1 |
Chart Detection and Recognition in Graphics Intensive Business DocumentsSvendsen, Jeremy Paul 24 December 2015 (has links)
Document image analysis involves the recognition and understanding of document images using computer vision techniques. The research described in this thesis relates to the recognition of graphical elements of a document image. More specifically, an approach for recognizing various types of charts as well as their components is presented. This research has many potential applications. For example, a user could redraw a chart in a different style or convert the chart to a table, without possessing the original information that was used to create the chart. Another application is the ability to find information, which is only presented in the chart, using a search engine.
A complete solution to chart image recognition and understanding is presented. The proposed algorithm extracts enough information such that the chart can be recreated. The method is a syntactic approach which uses mathematical grammars to recognize and classify every component of a chart. There are two grammars presented in this thesis, one which analyzes 2D and 3D pie charts and the other which analyzes 2D and 3D bar charts, as well as line charts. The pie chart grammar isolates each slice and its properties whereas the bar and line chart grammar recognizes the bars, indices, gridlines and polylines.
The method is evaluated in two ways. A qualitative approach redraws the chart for the user, and a semi-automated quantitative approach provides a complete analysis of the accuracy of the proposed method. The qualitative analysis allows the user to see exactly what has been classified correctly. The quantitative analysis gives more detailed information about the strengths and weaknesses of the proposed method. The results of the evaluation process show that the accuracy of the proposed methods for chart recognition is very high. / Graduate
|
2 |
Issues in Performance Evaluation of Mathematical Notation Recognition SystemsLapointe, Adrien 28 May 2008 (has links)
Performance evaluation of document recognition systems is a difficult and practically-important problem. In this thesis, we contribute to the understanding of performance evaluation by studying some issues that arise in evaluation of systems for recognition of mathematical expressions. Issues that are discussed cover the reported performance evaluation experiments, the code availability, the nature of the mathematical notation, the extent of the coverage of mathematical recognition systems, and the quantification of performance evaluation results. For each issue, we discuss its impact on performance evaluation, give an overview of the state of the art for addressing it and point out open problems. / Thesis (Master, Computing) -- Queen's University, 2008-05-21 15:34:21.966
|
3 |
Reconnaissance et classification d’images de documents / Document image retrieval and classificationAugereau, Olivier 14 February 2013 (has links)
Ces travaux de recherche ont pour ambition de contribuer à la problématique de la classification d’images de documents. Plus précisément, ces travaux tendent à répondre aux problèmes rencontrés par des sociétés de numérisation dont l’objectif est de mettre à disposition de leurs clients une version numérique des documents papiers accompagnés d’informations qui leurs sont relatives. Face à la diversité des documents à numériser, l’extraction d’informations peut s’avérer parfois complexe. C’est pourquoi la classification et l’indexation des documents sont très souvent réalisées manuellement. Ces travaux de recherche ont permis de fournir différentes solutions en fonction des connaissances relatives aux images que possède l’utilisateur ayant en charge l’annotation des documents.Le premier apport de cette thèse est la mise en place d’une méthode permettant, de manière interactive, à un utilisateur de classer des images de documents dont la nature est inconnue. Le second apport de ces travaux est la proposition d’une technique de recherche d’images de documents par l’exemple basée sur l’extraction et la mise en correspondance de points d’intérêts. Le dernier apport de cette thèse est l’élaboration d’une méthode de classification d’images de documents utilisant les techniques de sacs de mots visuels. / The aim of this research is to contribute to the document image classification problem. More specifically, these studies address digitizing company issues which objective is to provide the digital version of paper document with information relating to them. Given the diversity of documents, information extraction can be complex. This is why the classification and the indexing of documents are often performed manually. This research provides several solutions based on knowledge of the images that the user has. The first contribution of this thesis is a method for classifying interactively document images, where the content of documents and classes are unknown. The second contribution of this work is a new technique for document image retrieval by giving one example of researched document. This technique is based on the extraction and matching of interest points. The last contribution of this thesis is a method for classifying document images by using bags of visual words techniques.
|
4 |
A User Centered Design and Prototype of a Mobile Reading Device for the Visually ImpairedKeefer, Robert B. 10 June 2011 (has links)
No description available.
|
5 |
Multi-Oriented and multi-scaled text character analysis and recognition in graphical documents and their apllications to document image retrievalPratim Roy, Partha 03 November 2010 (has links)
With the advent research of Document Image Analysis and Recognition (DIAR), an
important line of research is explored on indexing and retrieval of graphics rich docu-
ments. It aims at nding relevant documents relying on segmentation and recognition
of text and graphics components underlying in non-standard layout where commercial
OCRs can not be applied due to complexity. This thesis is focused towards text infor-
mation extraction approaches in graphical documents and retrieval of such documents
using text information.
Automatic text recognition in graphical documents (map, engineering drawing,
etc.) involves many challenges because text characters are usually printed in multi-
oriented and multi-scale way along with di erent graphical objects. Text characters
are used to annotate the graphical curve lines and hence, many times they follow
curvi-linear paths too. For OCR of such documents, individual text lines and their
corresponding words/characters need to be extracted.
For recognition of multi-font, multi-scale and multi-oriented characters, we have
proposed a feature descriptor for character shape using angular information from con-
tour pixels to take care of the invariance nature. To improve the e ciency of OCR, an
approach towards the segmentation of multi-oriented touching strings into individual
characters is also discussed. Convex hull based background information is used to
segment a touching string into possible primitive segments and later these primitive
segments are merged to get optimum segmentation using dynamic programming. To
overcome the touching/overlapping problem of text with graphical lines, a character
spotting approach using SIFT and skeleton information is included. Afterwards, we
propose a novel method to extract individual curvi-linear text lines using the fore-
ground and background information of the characters of the text and a water reservoir
concept is used to utilize the background information.
We have also formulated the methodologies for graphical document retrieval ap-
plications using query words and seals. The retrieval approaches are performed using
recognition results of individual components in the document. Given a query text,
the system extracts positional knowledge from the query word and uses the same to
generate hypothetical locations in the document. Indexing of documents is also per-
formed based on automatic detection of seals from documents containing cluttered
background. A seal is characterized by scale and rotation invariant spatial feature
descriptors computed from labelled text characters and a concept based on the Generalized Hough Transform is used to locate the seal in documents.
Keywords: Document Image Analysis, Graphics Recognition, Dynamic Pro-
gramming, Generalized Hough Transform, Character Recognition, Touching Charac-
ter Segmentation, Text/Graphics Separation, Curve-Line Separation, Word Retrieval,
Seal Detection and Recognition.
|
6 |
Modèle de dégradation d’images de documents anciens pour la génération de données semi-synthétiques / Semi-synthetic ancient document image generation by using document degradation modelsKieu, Van Cuong 25 November 2014 (has links)
Le nombre important de campagnes de numérisation mises en place ces deux dernières décennies a entraîné une effervescence scientifique ayant mené à la création de nombreuses méthodes pour traiter et/ou analyser ces images de documents (reconnaissance d’écriture, analyse de la structure de documents, détection/indexation et recherche d’éléments graphiques, etc.). Un bon nombre de ces approches est basé sur un apprentissage (supervisé, semi supervisé ou non supervisé). Afin de pouvoir entraîner les algorithmes correspondants et en comparer les performances, la communauté scientifique a un fort besoin de bases publiques d’images de documents avec la vérité-terrain correspondante, et suffisamment exhaustive pour contenir des exemples représentatifs du contenu des documents à traiter ou analyser. La constitution de bases d’images de documents réels nécessite d’annoter les données (constituer la vérité terrain). Les performances des approches récentes d’annotation automatique étant très liées à la qualité et à l’exhaustivité des données d’apprentissage, ce processus d’annotation reste très largement manuel. Ce processus peut s’avérer complexe, subjectif et fastidieux. Afin de tenter de pallier à ces difficultés, plusieurs initiatives de crowdsourcing ont vu le jour ces dernières années, certaines sous la forme de jeux pour les rendre plus attractives. Si ce type d’initiatives permet effectivement de réduire le coût et la subjectivité des annotations, reste un certain nombre de difficultés techniques difficiles à résoudre de manière complètement automatique, par exemple l’alignement de la transcription et des lignes de texte automatiquement extraites des images. Une alternative à la création systématique de bases d’images de documents étiquetées manuellement a été imaginée dès le début des années 90. Cette alternative consiste à générer des images semi-synthétiques imitant les images réelles. La génération d’images de documents semi-synthétiques permet de constituer rapidement un volume de données important et varié, répondant ainsi aux besoins de la communauté pour l’apprentissage et l’évaluation de performances de leurs algorithmes. Dans la cadre du projet DIGIDOC (Document Image diGitisation with Interactive DescriptiOn Capability) financé par l’ANR (Agence Nationale de la Recherche), nous avons mené des travaux de recherche relatifs à la génération d’images de documents anciens semi-synthétiques. Le premier apport majeur de nos travaux réside dans la création de plusieurs modèles de dégradation permettant de reproduire de manière synthétique des déformations couramment rencontrées dans les images de documents anciens (dégradation de l’encre, déformation du papier, apparition de la transparence, etc.). Le second apport majeur de ces travaux de recherche est la mise en place de plusieurs bases d’images semi-synthétiques utilisées dans des campagnes de test (compétition ICDAR2013, GREC2013) ou pour améliorer par ré-apprentissage les résultats de méthodes de reconnaissance de caractères, de segmentation ou de binarisation. Ces travaux ont abouti sur plusieurs collaborations nationales et internationales, qui se sont soldées en particulier par plusieurs publications communes. Notre but est de valider de manière la plus objective possible, et en collaboration avec la communauté scientifique concernée, l’intérêt des images de documents anciens semi-synthétiques générées pour l’évaluation de performances et le ré-apprentissage. / In the last two decades, the increase in document image digitization projects results in scientific effervescence for conceiving document image processing and analysis algorithms (handwritten recognition, structure document analysis, spotting and indexing / retrieval graphical elements, etc.). A number of successful algorithms are based on learning (supervised, semi-supervised or unsupervised). In order to train such algorithms and to compare their performances, the scientific community on document image analysis needs many publicly available annotated document image databases. Their contents must be exhaustive enough to be representative of the possible variations in the documents to process / analyze. To create real document image databases, one needs an automatic or a manual annotation process. The performance of an automatic annotation process is proportional to the quality and completeness of these databases, and therefore annotation remains largely manual. Regarding the manual process, it is complicated, subjective, and tedious. To overcome such difficulties, several crowd-sourcing initiatives have been proposed, and some of them being modelled as a game to be more attractive. Such processes reduce significantly the price andsubjectivity of annotation, but difficulties still exist. For example, transcription and textline alignment have to be carried out manually. Since the 1990s, alternative document image generation approaches have been proposed including in generating semi-synthetic document images mimicking real ones. Semi-synthetic document image generation allows creating rapidly and cheaply benchmarking databases for evaluating the performances and trainingdocument processing and analysis algorithms. In the context of the project DIGIDOC (Document Image diGitisation with Interactive DescriptiOn Capability) funded by ANR (Agence Nationale de la Recherche), we focus on semi-synthetic document image generation adapted to ancient documents. First, we investigate new degradation models or adapt existing degradation models to ancient documents such as bleed-through model, distortion model, character degradation model, etc. Second, we apply such degradation models to generate semi-synthetic document image databases for performance evaluation (e.g the competition ICDAR2013, GREC2013) or for performance improvement (by re-training a handwritten recognition system, a segmentation system, and a binarisation system). This research work raises many collaboration opportunities with other researchers to share our experimental results with our scientific community. This collaborative work also helps us to validate our degradation models and to prove the efficiency of semi-synthetic document images for performance evaluation and re-training.
|
7 |
Neural Networks for Document Image and Text ProcessingPastor Pellicer, Joan 03 November 2017 (has links)
Nowadays, the main libraries and document archives are investing a considerable effort on digitizing their collections. Indeed, most of them are scanning the documents and publishing the resulting images without their corresponding transcriptions. This seriously limits the document exploitation possibilities. When the transcription is necessary, it is manually performed by human experts, which is a very expensive and error-prone task. Obtaining transcriptions to the level of required quality demands the intervention of human experts to review and correct the resulting output of the recognition engines. To this end, it is extremely useful to provide interactive tools to obtain and edit the transcription.
Although text recognition is the final goal, several previous steps (known as preprocessing) are necessary in order to get a fine transcription from a digitized image. Document cleaning, enhancement, and binarization (if they are needed) are the first stages of the recognition pipeline. Historical Handwritten Documents, in addition, show several degradations, stains, ink-trough and other artifacts. Therefore, more sophisticated and elaborate methods are required when dealing with these kind of documents, even expert supervision in some cases is needed. Once images have been cleaned, main zones of the image have to be detected: those that contain text and other parts such as images, decorations, versal letters. Moreover, the relations among them and the final text have to be detected. Those preprocessing steps are critical for the final performance of the system since an error at this point will be propagated during the rest of the transcription process.
The ultimate goal of the Document Image Analysis pipeline is to receive the transcription of the text (Optical Character Recognition and Handwritten Text Recognition). During this thesis we aimed to improve the main stages of the recognition pipeline, from the scanned documents as input to the final transcription. We focused our effort on applying Neural Networks and deep learning techniques directly on the document images to extract suitable features that will be used by the different tasks dealt during the following work: Image Cleaning and Enhancement (Document Image Binarization), Layout Extraction, Text Line Extraction, Text Line Normalization and finally decoding (or text line recognition). As one can see, the following work focuses on small improvements through the several Document Image Analysis stages, but also deals with some of the real challenges: historical manuscripts and documents without clear layouts or very degraded documents.
Neural Networks are a central topic for the whole work collected in this document.
Different convolutional models have been applied for document image cleaning and enhancement. Connectionist models have been used, as well, for text line extraction:
first, for detecting interest points and combining them in text segments and, finally, extracting the lines by means of aggregation techniques; and second, for pixel labeling to extract the main body area of the text and then the limits of the lines. For text line preprocessing, i.e., to normalize the text lines before recognizing them, similar models have been used to detect the main body area and then to height-normalize the images giving more importance to the central area of the text. Finally, Convolutional Neural Networks and deep multilayer perceptrons have been combined with hidden Markov models to improve our transcription engine significantly.
The suitability of all these approaches has been tested with different corpora for any of the stages dealt, giving competitive results for most of the methodologies presented. / Hoy en día, las principales librerías y archivos está invirtiendo un esfuerzo considerable en la digitalización de sus colecciones. De hecho, la mayoría están escaneando estos documentos y publicando únicamente las imágenes sin transcripciones, limitando seriamente la posibilidad de explotar estos documentos. Cuando la transcripción es necesaria, esta se realiza normalmente por expertos de forma manual, lo cual es una tarea costosa y propensa a errores. Si se utilizan sistemas de reconocimiento automático se necesita la intervención de expertos humanos para revisar y corregir la salida de estos motores de reconocimiento.
Por ello, es extremadamente útil para proporcionar herramientas interactivas con el fin de generar y corregir la transcripciones.
Aunque el reconocimiento de texto es el objetivo final del Análisis de Documentos, varios pasos previos (preprocesamiento) son necesarios para conseguir una buena transcripción a partir de una imagen digitalizada. La limpieza, mejora y binarización de las imágenes son las primeras etapas del proceso de reconocimiento. Además, los manuscritos históricos tienen una mayor dificultad en el preprocesamiento, puesto que pueden mostrar varios tipos de degradaciones, manchas, tinta a través del papel y demás dificultades. Por lo tanto, este tipo de documentos requiere métodos de preprocesamiento más sofisticados. En algunos casos, incluso, se precisa de la supervisión de expertos para garantizar buenos resultados en esta etapa. Una vez que las imágenes han sido limpiadas, las diferentes zonas de la imagen deben de ser localizadas: texto, gráficos, dibujos, decoraciones, letras versales, etc. Por otra parte, también es importante conocer las relaciones entre estas entidades. Estas etapas del pre-procesamiento son críticas para el rendimiento final del sistema, ya que los errores cometidos en aquí se propagarán al resto del proceso de transcripción.
El objetivo principal del trabajo presentado en este documento es mejorar las principales etapas del proceso de reconocimiento completo: desde las imágenes escaneadas hasta la transcripción final. Nuestros esfuerzos se centran en aplicar técnicas de Redes Neuronales (ANNs) y aprendizaje profundo directamente sobre las imágenes de los documentos, con la intención de extraer características adecuadas para las diferentes tareas: Limpieza y Mejora de Documentos, Extracción de Líneas, Normalización de Líneas de Texto y, finalmente, transcripción del texto. Como se puede apreciar, el trabajo se centra en pequeñas mejoras en diferentes etapas del Análisis y Procesamiento de Documentos, pero también trata de abordar tareas más complejas: manuscritos históricos, o documentos que presentan degradaciones.
Las ANNs y el aprendizaje profundo son uno de los temas centrales de esta tesis.
Diferentes modelos neuronales convolucionales se han desarrollado para la limpieza y mejora de imágenes de documentos. También se han utilizado modelos conexionistas para la extracción de líneas: primero, para detectar puntos de interés y segmentos de texto y, agregarlos para extraer las líneas del documento; y en segundo lugar, etiquetando directamente los píxeles de la imagen para extraer la zona central del texto y así definir los límites de las líneas. Para el preproceso de las líneas de texto, es decir, la normalización del texto antes del reconocimiento final, se han utilizado modelos similares a los mencionados para detectar la zona central del texto. Las imagenes se rescalan a una altura fija dando más importancia a esta zona central. Por último, en cuanto a reconocimiento de escritura manuscrita, se han combinado técnicas de ANNs y aprendizaje profundo con Modelos Ocultos de Markov, mejorando significativamente los resultados obtenidos previamente por nuestro motor de reconocimiento.
La idoneidad de todos estos enfoques han sido testeados con diferentes corpus en cada una de las tareas tratadas., obtenie / Avui en dia, les principals llibreries i arxius històrics estan invertint un esforç considerable en la digitalització de les seues col·leccions de documents. De fet, la majoria estan escanejant aquests documents i publicant únicament les imatges sense les seues transcripcions, fet que limita seriosament la possibilitat d'explotació d'aquests documents. Quan la transcripció del text és necessària, normalment aquesta és realitzada per experts de forma manual, la qual cosa és una tasca costosa i pot provocar errors. Si s'utilitzen sistemes de reconeixement automàtic es necessita la intervenció d'experts humans per a revisar i corregir l'eixida d'aquests motors de reconeixement. Per aquest motiu, és extremadament útil proporcionar eines interactives amb la finalitat de generar i corregir les transcripcions generades pels motors de reconeixement.
Tot i que el reconeixement del text és l'objectiu final de l'Anàlisi de Documents, diversos passos previs (coneguts com preprocessament) són necessaris per a l'obtenció de transcripcions acurades a partir d'imatges digitalitzades.
La neteja, millora i binarització de les imatges (si calen) són les primeres etapes prèvies al reconeixement. A més a més, els manuscrits històrics presenten una major dificultat d'analisi i preprocessament, perquè poden mostrar diversos tipus de degradacions, taques, tinta a través del paper i altres peculiaritats. Per tant, aquest tipus de documents requereixen mètodes de preprocessament més sofisticats. En alguns casos, fins i tot, es precisa de la supervisió d'experts per a garantir bons resultats en aquesta etapa. Una vegada que les imatges han sigut netejades, les diferents zones de la imatge han de ser localitzades: text, gràfics, dibuixos, decoracions, versals, etc. D'altra banda, també és important conéixer les relacions entre aquestes entitats i el text que contenen. Aquestes etapes del preprocessament són crítiques per al rendiment final del sistema, ja que els errors comesos en aquest moment es propagaran a la resta del procés de transcripció.
L'objectiu principal del treball que estem presentant és millorar les principals etapes del procés de reconeixement, és a dir, des de les imatges escanejades fins a l'obtenció final de la transcripció del text. Els nostres esforços se centren en aplicar tècniques de Xarxes Neuronals (ANNs) i aprenentatge profund directament sobre les imatges de documents, amb la intenció d'extraure característiques adequades per a les diferents tasques analitzades: neteja i millora de documents, extracció de línies, normalització de línies de text i, finalment, transcripció. Com es pot apreciar, el treball realitzat aplica xicotetes millores en diferents etapes de l'Anàlisi de Documents, però també tracta d'abordar tasques més complexes: manuscrits històrics, o documents que presenten degradacions.
Les ANNs i l'aprenentatge profund són un dels temes centrals d'aquesta tesi.
Diferents models neuronals convolucionals s'han desenvolupat per a la neteja i millora de les dels documents. També s'han utilitzat models connexionistes per a la tasca d'extracció de línies: primer, per a detectar punts d'interés i segments de text i, agregar-los per a extraure les línies del document; i en segon lloc, etiquetant directament els pixels de la imatge per a extraure la zona central del text i així definir els límits de les línies. Per al preprocés de les línies de text, és a dir, la normalització del text abans del reconeixement final, s'han utilitzat models similars als utilitzats per a l'extracció de línies. Finalment, quant al reconeixement d'escriptura manuscrita, s'han combinat tècniques de ANNs i aprenentatge profund amb Models Ocults de Markov, que han millorat significativament els resultats obtinguts prèviament pel nostre motor de reconeixement.
La idoneïtat de tots aquests enfocaments han sigut testejats amb diferents corpus en cadascuna de les tasques tractad / Pastor Pellicer, J. (2017). Neural Networks for Document Image and Text Processing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/90443
|
8 |
Cluster-based Sample Selection for Document Image BinarizationKrantz, Amandus January 2019 (has links)
The current state-of-the-art, in terms of performance, for solving document image binarization is training artificial neural networks on pre-labelled ground truth data. As such, it faces the same issues as other, more conventional, classification problems; requiring a large amount of training data. However, unlike those conventional classification problems, document image binarization involves having to either manually craft or estimate the binarized ground truth data, which can be error-prone and time-consuming. This is where sample selection, the act of selecting training samples based on some method or metric, might help. By reducing the size of the training dataset in such a way that the binarization performance is not impacted, the required time spent creating the ground truth is also reduced. This thesis proposes a cluster-based sample selection method, based on previous work, that uses image similarity metrics and the relative neighbourhood graph to reduce the underlying redundancy of the dataset. The method is implemented with different clustering methods and similarity metrics for comparison, with the best implementation being based on affinity propagation and the structural similarity index. This implementation manages to reduce the training dataset by 46\% while maintaining a performance that is equal to that of the complete dataset. The performance of this method is shown to not be significantly different from randomly selecting the same number of samples. However, due to limitations in the random method, such as unpredictable performance and uncertainty in how many samples to select, the use of sample selection in document image binarization still shows great promise.
|
9 |
Mathematical Expression Detection and Segmentation in Document ImagesBruce, Jacob Robert 19 March 2014 (has links)
Various document layout analysis techniques are employed in order to enhance the accuracy of optical character recognition (OCR) in document images. Type-specific document layout analysis involves localizing and segmenting specific zones in an image so that they may be recognized by specialized OCR modules. Zones of interest include titles, headers/footers, paragraphs, images, mathematical expressions, chemical equations, musical notations, tables, circuit diagrams, among others. False positive/negative detections, oversegmentations, and undersegmentations made during the detection and segmentation stage will confuse a specialized OCR system and thus may result in garbled, incoherent output. In this work a mathematical expression detection and segmentation (MEDS) module is implemented and then thoroughly evaluated. The module is fully integrated with the open source OCR software, Tesseract, and is designed to function as a component of it. Evaluation is carried out on freely available public domain images so that future and existing techniques may be objectively compared. / Master of Science
|
10 |
Segmentation of heterogeneous document images : an approach based on machine learning, connected components analysis, and texture analysisBonakdar Sakhi, Omid 06 December 2012 (has links) (PDF)
Document page segmentation is one of the most crucial steps in document image analysis. It ideally aims to explain the full structure of any document page, distinguishing text zones, graphics, photographs, halftones, figures, tables, etc. Although to date, there have been made several attempts of achieving correct page segmentation results, there are still many difficulties. The leader of the project in the framework of which this PhD work has been funded (*) uses a complete processing chain in which page segmentation mistakes are manually corrected by human operators. Aside of the costs it represents, this demands tuning of a large number of parameters; moreover, some segmentation mistakes sometimes escape the vigilance of the operators. Current automated page segmentation methods are well accepted for clean printed documents; but, they often fail to separate regions in handwritten documents when the document layout structure is loosely defined or when side notes are present inside the page. Moreover, tables and advertisements bring additional challenges for region segmentation algorithms. Our method addresses these problems. The method is divided into four parts:1. Unlike most of popular page segmentation methods, we first separate text and graphics components of the page using a boosted decision tree classifier.2. The separated text and graphics components are used among other features to separate columns of text in a two-dimensional conditional random fields framework.3. A text line detection method, based on piecewise projection profiles is then applied to detect text lines with respect to text region boundaries.4. Finally, a new paragraph detection method, which is trained on the common models of paragraphs, is applied on text lines to find paragraphs based on geometric appearance of text lines and their indentations. Our contribution over existing work lies in essence in the use, or adaptation, of algorithms borrowed from machine learning literature, to solve difficult cases. Indeed, we demonstrate a number of improvements : on separating text columns when one is situated very close to the other; on preventing the contents of a cell in a table to be merged with the contents of other adjacent cells; on preventing regions inside a frame to be merged with other text regions around, especially side notes, even when the latter are written using a font similar to that the text body. Quantitative assessment, and comparison of the performances of our method with competitive algorithms using widely acknowledged metrics and evaluation methodologies, is also provided to a large extend.(*) This PhD thesis has been funded by Conseil Général de Seine-Saint-Denis, through the FUI6 project Demat-Factory, lead by Safig SA
|
Page generated in 0.0715 seconds