• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 9
  • 1
  • 1
  • 1
  • Tagged with
  • 17
  • 17
  • 13
  • 9
  • 8
  • 7
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Advances in Document Layout Analysis

Bosch Campos, Vicente 05 March 2020 (has links)
[EN] Handwritten Text Segmentation (HTS) is a task within the Document Layout Analysis field that aims to detect and extract the different page regions of interest found in handwritten documents. HTS remains an active topic, that has gained importance with the years, due to the increasing demand to provide textual access to the myriads of handwritten document collections held by archives and libraries. This thesis considers HTS as a task that must be tackled in two specialized phases: detection and extraction. We see the detection phase fundamentally as a recognition problem that yields the vertical positions of each region of interest as a by-product. The extraction phase consists in calculating the best contour coordinates of the region using the position information provided by the detection phase. Our proposed detection approach allows us to attack both higher level regions: paragraphs, diagrams, etc., and lower level regions like text lines. In the case of text line detection we model the problem to ensure that the system's yielded vertical position approximates the fictitious line that connects the lower part of the grapheme bodies in a text line, commonly known as the baseline. One of the main contributions of this thesis, is that the proposed modelling approach allows us to include prior information regarding the layout of the documents being processed. This is performed via a Vertical Layout Model (VLM). We develop a Hidden Markov Model (HMM) based framework to tackle both region detection and classification as an integrated task and study the performance and ease of use of the proposed approach in many corpora. We review the modelling simplicity of our approach to process regions at different levels of information: text lines, paragraphs, titles, etc. We study the impact of adding deterministic and/or probabilistic prior information and restrictions via the VLM that our approach provides. Having a separate phase that accurately yields the detection position (base- lines in the case of text lines) of each region greatly simplifies the problem that must be tackled during the extraction phase. In this thesis we propose to use a distance map that takes into consideration the grey-scale information in the image. This allows us to yield extraction frontiers which are equidistant to the adjacent text regions. We study how our approach escalates its accuracy proportionally to the quality of the provided detection vertical position. Our extraction approach gives near perfect results when human reviewed baselines are provided. / [ES] La Segmentación de Texto Manuscrito (STM) es una tarea dentro del campo de investigación de Análisis de Estructura de Documentos (AED) que tiene como objetivo detectar y extraer las diferentes regiones de interés de las páginas que se encuentran en documentos manuscritos. La STM es un tema de investigación activo que ha ganado importancia con los años debido a la creciente demanda de proporcionar acceso textual a las miles de colecciones de documentos manuscritos que se conservan en archivos y bibliotecas. Esta tesis entiende la STM como una tarea que debe ser abordada en dos fases especializadas: detección y extracción. Consideramos que la fase de detección es, fundamentalmente, un problema de clasificación cuyo subproducto son las posiciones verticales de cada región de interés. Por su parte, la fase de extracción consiste en calcular las mejores coordenadas de contorno de la región utilizando la información de posición proporcionada por la fase de detección. Nuestro enfoque de detección nos permite atacar tanto regiones de alto nivel (párrafos, diagramas¿) como regiones de nivel bajo (líneas de texto principalmente). En el caso de la detección de líneas de texto, modelamos el problema para asegurar que la posición vertical estimada por el sistema se aproxime a la línea ficticia que conecta la parte inferior de los cuerpos de los grafemas en una línea de texto, comúnmente conocida como línea base. Una de las principales aportaciones de esta tesis es que el enfoque de modelización propuesto nos permite incluir información conocida a priori sobre la disposición de los documentos que se están procesando. Esto se realiza mediante un Modelo de Estructura Vertical (MEV). Desarrollamos un marco de trabajo basado en los Modelos Ocultos de Markov (MOM) para abordar tanto la detección de regiones como su clasificación de forma integrada, así como para estudiar el rendimiento y la facilidad de uso del enfoque propuesto en numerosos corpus. Así mismo, revisamos la simplicidad del modelado de nuestro enfoque para procesar regiones en diferentes niveles de información: líneas de texto, párrafos, títulos, etc. Finalmente, estudiamos el impacto de añadir información y restricciones previas deterministas o probabilistas a través de el MEV propuesto que nuestro enfoque proporciona. Disponer de un método independiente que obtiene con precisión la posición de cada región detectada (líneas base en el caso de las líneas de texto) simplifica enormemente el problema que debe abordarse durante la fase de extracción. En esta tesis proponemos utilizar un mapa de distancias que tiene en cuenta la información de escala de grises de la imagen. Esto nos permite obtener fronteras de extracción que son equidistantes a las regiones de texto adyacentes. Estudiamos como nuestro enfoque aumenta su precisión de manera proporcional a la calidad de la detección y descubrimos que da resultados casi perfectos cuando se le proporcionan líneas de base revisadas por humanos. / [CA] La Segmentació de Text Manuscrit (STM) és una tasca dins del camp d'investigació d'Anàlisi d'Estructura de Documents (AED) que té com a objectiu detectar I extraure les diferents regions d'interès de les pàgines que es troben en documents manuscrits. La STM és un tema d'investigació actiu que ha guanyat importància amb els anys a causa de la creixent demanda per proporcionar accés textual als milers de col·leccions de documents manuscrits que es conserven en arxius i biblioteques. Aquesta tesi entén la STM com una tasca que ha de ser abordada en dues fases especialitzades: detecció i extracció. Considerem que la fase de detecció és, fonamentalment, un problema de classificació el subproducte de la qual són les posicions verticals de cada regió d'interès. Per la seva part, la fase d'extracció consisteix a calcular les millors coordenades de contorn de la regió utilitzant la informació de posició proporcionada per la fase de detecció. El nostre enfocament de detecció ens permet atacar tant regions d'alt nivell (paràgrafs, diagrames ...) com regions de nivell baix (línies de text principalment). En el cas de la detecció de línies de text, modelem el problema per a assegurar que la posició vertical estimada pel sistema s'aproximi a la línia fictícia que connecta la part inferior dels cossos dels grafemes en una línia de text, comunament coneguda com a línia base. Una de les principals aportacions d'aquesta tesi és que l'enfocament de modelització proposat ens permet incloure informació coneguda a priori sobre la disposició dels documents que s'estan processant. Això es realitza mitjançant un Model d'Estructura Vertical (MEV). Desenvolupem un marc de treball basat en els Models Ocults de Markov (MOM) per a abordar tant la detecció de regions com la seva classificació de forma integrada, així com per a estudiar el rendiment i la facilitat d'ús de l'enfocament proposat en nombrosos corpus. Així mateix, revisem la simplicitat del modelatge del nostre enfocament per a processar regions en diferents nivells d'informació: línies de text, paràgrafs, títols, etc. Finalment, estudiem l'impacte d'afegir informació i restriccions prèvies deterministes o probabilistes a través del MEV que el nostre mètode proporciona. Disposar d'un mètode independent que obté amb precisió la posició de cada regió detectada (línies base en el cas de les línies de text) simplifica enormement el problema que ha d'abordar-se durant la fase d'extracció. En aquesta tesi proposem utilitzar un mapa de distàncies que té en compte la informació d'escala de grisos de la imatge. Això ens permet obtenir fronteres d'extracció que són equidistants de les regions de text adjacents. Estudiem com el nostre enfocament augmenta la seva precisió de manera proporcional a la qualitat de la detecció i descobrim que dona resultats quasi perfectes quan se li proporcionen línies de base revisades per humans. / Bosch Campos, V. (2020). Advances in Document Layout Analysis [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/138397
12

Deep Learning Methodologies for Textual and Graphical Content-Based Analysis of Handwritten Text Images

Prieto Fontcuberta, José Ramón 08 July 2024 (has links)
[ES] En esta tesis se abordan problemas no resueltos en el campo de la Inteligencia Artificial aplicada a documentos históricos manuscritos. Primero haremos un recorrido por diversas técnicas y conceptos que se utilizarán durante la tesis. Se explorarán diferentes formas de representar datos, incluidas imágenes, texto y grafos. Se introducirá el concepto de Índices Probabilísticos (PrIx) para la representación textual y se explicará su codificación usando TfIdf. También se discutirá la selección de las mejores características de entrada para redes neuronales mediante Information Gain (IG). En el ámbito de las redes neuronales, se abordarán modelos específicos como Multilayer Perceptron (MLP), Redes Neuronales Convolucionales (CNNs) y redes basadas en grafos (GNNs), además de una breve introducción a los transformers. El primer problema que aborda la tesis es la segmentación de libros históricos manuscritos en unidades semánticas, un desafío complejo y recurrente en archivos de todo el mundo. A diferencia de los libros modernos, donde la segmentación en capítulos es más sencilla, los libros históricos presentan desafíos únicos debido a su irregularidad y posible mala conservación. La tesis define formalmente este problema por primera vez y propone un pipeline para extraer consistentemente las unidades semánticas en dos variantes: una con restricciones del corpus y otra sin ellas. Se emplearán diferentes tipos de redes neuronales, incluidas CNNs para la clasificación de partes de la imagen y RPNs y transformers para detectar y clasificar regiones. Además, se introduce una nueva métrica para medir la pérdida de información en la detección, alineación y transcripción de estas unidades semánticas. Finalmente, se comparan diferentes métodos de ``decoding'' y se evalúan los resultados en hasta cinco conjuntos de datos diferentes. En otro capítulo, la tesis aborda el desafío de clasificar documentos históricos manuscritos no transcritos, específicamente actos notariales en el Archivo Provincial Histórico de Cádiz. Se desarrollará un framework que utiliza Índices Probabilísticos (PrIx) para clasificar estos documentos y se comparará con transcripciones 1-best obtenidas mediante técnicas de Reconocimiento de Texto Manuscrito (HTR). Además de la clasificación convencional en un conjunto cerrado de clases (Close Set Classification, CSC), la tesis introduce el framework de Open Set Classification (OSC). Este enfoque no solo clasifica documentos en clases predefinidas, sino que también identifica aquellos que no pertenecen a ninguna de las clases establecidas, permitiendo que un experto los etiquete. Se compararán varias técnicas para este fin y se propondrán dos. Una sin umbral en las probabilidades a posteriori generadas por el modelo de red neuronal, y otra que utiliza un umbral en las mismas, con la opción de ajustarlo manualmente según las necesidades del experto. En un tercer capítulo, la tesis se centra en la Extracción de Información (IE) de documentos tabulares manuscritos. Se desarrolla un pipeline que comienza con la detección de texto en imágenes con tablas, línea por línea, seguido de su transcripción mediante técnicas de HTR. De forma paralela, se entrenarán diferentes modelos para identificar la estructura de las tablas, incluidas filas, columnas y secciones de cabecera. El pipeline también aborda problemas comunes en tablas manuscritas, como el multi-span de columnas y la sustitución de texto entre comillas. Además, se emplea un modelo de lenguaje entrenado específicamente para detectar automáticamente las cabeceras de las tablas. Se utilizarán dos conjuntos de datos para demostrar la eficacia del pipeline en la tarea de IE, y se identificarán las áreas de mejora en el propio pipeline para futuras investigaciones. / [CA] En aquesta tesi s'aborden problemes no resolts en el camp de la Intel·ligència Artificial aplicada a documents històrics manuscrits. Primer farem un recorregut per diverses tècniques i conceptes que s'utilitzaran durant la tesi. S'exploraran diferents formes de representar dades, incloses imatges, text i grafos. S'introduirà el concepte d'Índexs Probabilístics (PrIx) per a la representació textual i s'explicarà la seva codificació usant TfIdf. També es discutirà la selecció de les millors característiques d'entrada per a xarxes neuronals mitjançant Information Gain (IG). En l'àmbit de les xarxes neuronals, s'abordaran models específics com Multilayer Perceptron (MLP), Xarxes Neuronals Convolucionals (CNNs) i xarxes basades en grafos (GNNs), a més d'una breu introducció als transformers. El primer problema que aborda la tesi és la segmentació de llibres històrics manuscrits en unitats semàntiques, un desafiament complex i recurrent en arxius de tot el món. A diferència dels llibres moderns, on la segmentació en capítols és més senzilla, els llibres històrics presenten desafiaments únics degut a la seva irregularitat i possible mala conservació. La tesi defineix formalment aquest problema per primera vegada i proposa un pipeline per extreure consistentment les unitats semàntiques en dues variants: una amb restriccions del corpus i una altra sense elles. S'empraran diferents tipus de xarxes neuronals, incloses CNNs per a la classificació de parts de la imatge i RPNs i transformers per detectar i classificar regions. A més, s'introdueix una nova mètrica per mesurar la pèrdua d'informació en la detecció, alineació i transcripció d'aquestes unitats semàntiques. Finalment, es compararan diferents mètodes de ``decoding'' i s'avaluaran els resultats en fins a cinc conjunts de dades diferents. En un altre capítol, la tesi aborda el desafiament de classificar documents històrics manuscrits no transcrits, específicament actes notarials a l'Arxiu Provincial Històric de Càdiz. Es desenvoluparà un marc que utilitza Índexs Probabilístics (PrIx) per classificar aquests documents i es compararà amb transcripcions 1-best obtingudes mitjançant tècniques de Reconèixer Text Manuscrit (HTR). A més de la classificació convencional en un conjunt tancat de classes (Close Set Classification, CSC), la tesi introdueix el marc d'Open Set Classification (OSC). Aquest enfocament no només classifica documents en classes predefinides, sinó que també identifica aquells que no pertanyen a cap de les classes establertes, permetent que un expert els etiqueti. Es compararan diverses tècniques per a aquest fi i es proposaran dues. Una sense llindar en les probabilitats a posteriori generades pel model de xarxa neuronal, i una altra que utilitza un llindar en les mateixes, amb l'opció d'ajustar-lo manualment segons les necessitats de l'expert. En un tercer capítol, la tesi es centra en l'Extracció d'Informació (IE) de documents tabulars manuscrits. Es desenvolupa un pipeline que comença amb la detecció de text en imatges amb taules, línia per línia, seguit de la seva transcripció mitjançant tècniques de HTR. De forma paral·lela, s'entrenaran diferents models per identificar l'estructura de les taules, incloses files, columnes i seccions de capçalera. El pipeline també aborda problemes comuns en taules manuscrites, com ara el multi-span de columnes i la substitució de text entre cometes. A més, s'empra un model de llenguatge entrenat específicament per detectar automàticament les capçaleres de les taules. S'utilitzaran dos conjunts de dades per demostrar l'eficàcia del pipeline en la tasca de IE, i s'identificaran les àrees de millora en el propi pipeline per a futures investigacions. / [EN] This thesis addresses unresolved issues in the field of Artificial Intelligence as applied to historical handwritten documents. The challenges include not only the degradation of the documents but also the scarcity of available data for training specialized models. This limitation is particularly relevant when the trend is to use large datasets and massive models to achieve significant breakthroughs. First, we provide an overview of various techniques and concepts used throughout the thesis. Different ways of representing data are explored, including images, text, and graphs. Probabilistic Indices (PrIx) are introduced for textual representation and its encoding using TfIdf is be explained. We also discuss selecting the best input features for neural networks using Information Gain (IG). In the realm of neural networks, specific models such as Multilayer Perceptron (MLP), Convolutional Neural Networks (CNNs), and graph-based networks (GNNs) are covered, along with a brief introduction to transformers. The first problem addressed in this thesis is the segmentation of historical handwritten books into semantic units, a complex and recurring challenge in archives worldwide. Unlike modern books, where chapter segmentation is relatively straightforward, historical books present unique challenges due to their irregularities and potential poor preservation. To the best of our knowledge, this thesis formally defines this problem. We propose a pipeline to consistently extract these semantic units in two variations: one with corpus-specific constraints and another without them. Various types of neural networks are employed, including Convolutional Neural Networks (CNNs) for classifying different parts of the image and Region Proposal Networks (RPNs) and transformers for detecting and classifying regions. Additionally, a new metric is introduced to measure the information loss in the detection, alignment, and transcription of these semantic units. Finally, different decoding methods are compared, and the results are evaluated across up to five different datasets. In another chapter, we tackle the challenge of classifying non-transcribed historical handwritten documents, specifically notarial deeds, from the Provincial Historical Archive of Cádiz. A framework is developed that employs Probabilistic Indices (PrIx) for classifying these documents, and this is compared to 1-best transcriptions obtained through Handwritten Text Recognition (HTR) techniques. In addition to conventional classification within a closed set of classes (Close Set Classification, CSC), this thesis introduces the Open Set Classification (OSC) framework. This approach not only classifies documents into predefined classes but also identifies those that do not belong to any of the established classes, allowing an expert to label them. Various techniques are compared, and two are proposed. One approach without using a threshold on the posterior probabilities generated by the neural network model. At the same time, the other employs a threshold on these probabilities, with the option for manual adjustment according to the expert's needs. In a third chapter, this thesis focuses on Information Extraction (IE) from handwritten tabular documents. A pipeline is developed that starts with detecting text in images containing tables, line by line, followed by its transcription using HTR techniques. In parallel, various models are trained to identify the structure of the tables, including rows, columns, and header sections. The pipeline also addresses common issues in handwritten tables, such as multi-span columns and substituting ditto marks. Additionally, a language model specifically trained to detect table headers automatically is employed. Two datasets are used to demonstrate the effectiveness of the pipeline in the IE task, and areas for improvement within the pipeline itself are identified for future research. / Prieto Fontcuberta, JR. (2024). Deep Learning Methodologies for Textual and Graphical Content-Based Analysis of Handwritten Text Images [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/206075
13

Analysis Of Multi-lingual Documents With Complex Layout And Content

Pati, Peeta Basa 11 1900 (has links)
A document image, beside text, may contain pictures, graphs, signatures, logos, barcodes, hand-drawn sketches and/or seals. Further, the text blocks in an image may be in Manhattan or any complex layout. Document Layout Analysis is an important preprocessing step before subjecting any such image to OCR. Here, the image with complex layout and content is segmented into its constituent components. For many present day applications, separating the text from the non-text blocks is sufficient. This enables the conversion of the text elements present in the image to their corresponding editable form. In this work, an effort has been made to separate the text areas from the various kinds of possible non-text elements. The document images may have been obtained from a scanner or camera. If the source is a scanner, there is control on the scanning resolution, and lighting of the paper surface. Moreover, during the scanning process, the paper surface remains parallel to the sensor surface. However, when an image is obtained through a camera, these advantages are no longer available. Here, an algorithm is proposed to separate the text present in an image from the clutter, irrespective of the imaging technology used. This is achieved by using both the structural and textural information of the text present in the gray image. A bank of Gabor filters characterizes the statistical distribution of the text elements in the document. A connected component based technique removes certain types of non-text elements from the image. When a camera is used to acquire document images, generally, along with the structural and textural information of the text, color information is also obtained. It can be assumed that text present in an image has a certain amount of color homogeneity. So, a graph-theoretical color clustering scheme is employed to segment the iso-color components of the image. Each iso-color image is then analyzed separately for its structural and textural properties. The results of such analyses are merged with the information obtained from the gray component of the image. This helps to separate the colored text areas from the non-text elements. The proposed scheme is computationally intensive, because the separation of the text from non-text entities is performed at the pixel level Since any entity is represented by a connected set of pixels, it makes more sense to carry out the separation only at specific points, selected as representatives of their neighborhood. Harris' operator evaluates an edge-measure at each pixel and selects pixels, which are locally rich on this measure. These points are then employed for separating text from non-text elements. Many government documents and forms in India are bi-lingual or tri-lingual in nature. Further, in school text books, it is common to find English words interspersed within sentences in the main Indian language of the book. In such documents, successive words in a line of text may be of different scripts (languages). Hence, for OCR of these documents, the script must be recognized at the level of words, rather than lines or paragraphs. A database of about 20,000 words each from 11 Indian scripts1 is created. This is so far the largest database of Indian words collected and deployed for script recognition purpose. Here again, a bank of 36 Gabor filters is used to extract the feature vector which represents the script of the word. The effectiveness of Gabor features is compared with that of DCT and it is found that Gabor features marginally outperform the DOT. Simple, linear and non-linear classifiers are employed to classify the word in the feature space. It is assumed that a scheme developed to recognize the script of the words would work equally fine for sentences and paragraphs. This assumption has been verified with supporting results. A systematic study has been conducted to evaluate and compare the accuracy of various feature-classifier combinations for word script recognition. We have considered the cases of bi-script and tri-script documents, which are largely available. Average recognition accuracies for bi-script and tri-script cases are 98.4% and 98.2%, respectively. A hierarchical blind script recognizer, involving all eleven scripts has been developed and evaluated, which yields an average accuracy of 94.1%. The major contributions of the thesis are: • A graph theoretic color clustering scheme is used to segment colored text. • A scheme is proposed to separate text from the non-text content of documents with complex layout and content, captured by scanner or camera. • Computational complexity is reduced by performing the separation task on a selected set of locally edge-rich points. • Script identification at word level is carried out using different feature classifier combinations. Gabor features with SVM classifier outperforms any other feature-classifier combinations. A hierarchical blind script recognition algorithm, involving the recognition of 11 Indian scripts, is developed. This structure employs the most efficient feature-classifier combination at each individual nodal point of the tree to maximize the system performance. A sequential forward feature selection algorithm is employed to. select the most discriminating features, in a case by case basis, for script-recognition. The 11 scripts are Bengali, Devanagari, Gujarati, Kannada, Malayalam, Odiya, Puniabi, Roman. Tamil, Telugu and Urdu.
14

Fully Convolutional Neural Networks for Pixel Classification in Historical Document Images

Stewart, Seth Andrew 01 October 2018 (has links)
We use a Fully Convolutional Neural Network (FCNN) to classify pixels in historical document images, enabling the extraction of high-quality, pixel-precise and semantically consistent layers of masked content. We also analyze a dataset of hand-labeled historical form images of unprecedented detail and complexity. The semantic categories we consider in this new dataset include handwriting, machine-printed text, dotted and solid lines, and stamps. Segmentation of document images into distinct layers allows handwriting, machine print, and other content to be processed and recognized discriminatively, and therefore more intelligently than might be possible with content-unaware methods. We show that an efficient FCNN with relatively few parameters can accurately segment documents having similar textural content when trained on a single representative pixel-labeled document image, even when layouts differ significantly. In contrast to the overwhelming majority of existing semantic segmentation approaches, we allow multiple labels to be predicted per pixel location, which allows for direct prediction and reconstruction of overlapped content. We perform an analysis of prevalent pixel-wise performance measures, and show that several popular performance measures can be manipulated adversarially, yielding arbitrarily high measures based on the type of bias used to generate the ground-truth. We propose a solution to the gaming problem by comparing absolute performance to an estimated human level of performance. We also present results on a recent international competition requiring the automatic annotation of billions of pixels, in which our method took first place.
15

Fully Convolutional Neural Networks for Pixel Classification in Historical Document Images

Stewart, Seth Andrew 01 October 2018 (has links)
We use a Fully Convolutional Neural Network (FCNN) to classify pixels in historical document images, enabling the extraction of high-quality, pixel-precise and semantically consistent layers of masked content. We also analyze a dataset of hand-labeled historical form images of unprecedented detail and complexity. The semantic categories we consider in this new dataset include handwriting, machine-printed text, dotted and solid lines, and stamps. Segmentation of document images into distinct layers allows handwriting, machine print, and other content to be processed and recognized discriminatively, and therefore more intelligently than might be possible with content-unaware methods. We show that an efficient FCNN with relatively few parameters can accurately segment documents having similar textural content when trained on a single representative pixel-labeled document image, even when layouts differ significantly. In contrast to the overwhelming majority of existing semantic segmentation approaches, we allow multiple labels to be predicted per pixel location, which allows for direct prediction and reconstruction of overlapped content. We perform an analysis of prevalent pixel-wise performance measures, and show that several popular performance measures can be manipulated adversarially, yielding arbitrarily high measures based on the type of bias used to generate the ground-truth. We propose a solution to the gaming problem by comparing absolute performance to an estimated human level of performance. We also present results on a recent international competition requiring the automatic annotation of billions of pixels, in which our method took first place.
16

[en] GENERATION AND DETECTION OF OBJECTS IN DOCUMENTS BY DEEP LEARNING NEURAL NETWORK MODELS (DEEPDOCGEN) / [pt] GERAÇÃO E DETECÇÃO DE OBJETOS EM DOCUMENTOS POR MODELOS DE REDES NEURAIS DE APRENDIZAGEM PROFUNDA (DEEPDOCGEN)

LOICK GEOFFREY HODONOU 06 February 2025 (has links)
[pt] A eficácia dos sistemas de conversação homem-máquina, como chatbots e assistentes virtuais, está diretamente relacionada à quantidade e qualidade do conhecimento disponível para eles. Na era digital, a diversidade e a qualidade dos dados aumentaram significativamente, estando disponíveis em diversos formatos. Entre esses, o PDF (Portable Document Format) se destaca como um dos mais conhecidos e amplamente utilizados, adaptando-se a variados setores, como empresarial, educacional e de pesquisa. Esses arquivos contêm uma quantidade considerável de dados estruturados, como textos, títulos, listas, tabelas, imagens, etc. O conteúdo dos arquivos PDF pode ser extraído utilizando ferramentas dedicadas, como o OCR (Reconhecimento Ótico de Caracteres), o PdfMiner, Tabula e outras, que provaram ser adequadas para esta tarefa. No entanto, estas ferramentas podem deparar-se com dificuldades quando lidam com a apresentação complexa e variada dos documentos PDF. A exatidão da extração pode ser comprometida pela diversidade de esquemas, formatos não normalizados e elementos gráficos incorporados nos documentos, o que frequentemente leva a um pós-processamento manual. A visão computacional e, mais especificamente, a detecção de objetos, é um ramo do aprendizado de máquina que visa localizar e classificar instâncias em imagens utilizando modelos de detecção dedicados à tarefa, e está provando ser uma abordagem viável para acelerar o trabalho realizado por algoritmos como OCR, PdfMiner, Tabula, além de melhorar sua precisão. Os modelos de detecção de objetos, por serem baseados em aprendizagem profunda, exigem não apenas uma quantidade substancial de dados para treinamento, mas, acima de tudo, anotações de alta qualidade pois elas têm um impacto direto na obtenção de altos níveis de precisão e robustez. A diversidade de layouts e elementos gráficos em documentos PDF acrescenta uma camada adicional de complexidade, exigindo dados anotados de forma representativa para que os modelos possam aprender a lidar com todas as variações possíveis. Considerando o aspecto volumoso dos dados necessários para o treinamento dos modelos, percebemos rapidamente que o processo de anotação dos dados se torna uma tarefa tediosa e demorada que requer intervenção humana para identificar e etiquetar manualmente cada elemento relevante. Essa tarefa não é apenas demorada, mas também sujeita a erros humanos, o que muitas vezes exige verificações e correções adicionais. A fim de encontrar um meio-termo entre a quantidade de dados, a minimização do tempo de anotação e anotações de alta qualidade, neste trabalho propusemos um pipeline que, a partir de um número limitado de documentos PDF anotados com as categorias texto, título, lista, tabela e imagem recebidas como entrada, é capaz de criar novas layouts de documentos semelhantes com base no número desejado pelo usuário. Este pipeline vai mais longe em preenchendo com o conteúdo as novas layouts criadas, a fim de fornecer imagens de documentos sintéticos e suas respectivas anotações. Com sua estrutura simples, intuitiva e escalável, este pipeline pode contribuir para o active learning, permitindo assim aos modelos de detecção serem treinados continuamente, os tornando mais eficazes e robustos diante de documentos reais. Em nossas experiências, ao avaliar e comparar três modelos de detecção, observamos que o RT-DETR (Real-Time DEtection TRansformer) obteve os melhores resultados, atingindo uma precisão média (mean Average Precision, mAP) de 96,30 por cento, superando os resultados do Mask R-CNN (Region-based Convolutional Neural Networks) e Mask DINO (Mask DETR with Improved Denoising Anchor Boxes). A superioridade do RT-DETR indica seu potencial para se tornar uma solução de referência na detecção de características em documentos PDF. Esses resultados promissores abrem caminho para aplicações mais eficientes e confiáveis no processamento automático de documentos. / [en] The effectiveness of human-machine conversation systems, such as chat-bots and virtual assistants, is directly related to the amount and quality of knowledge available to them. In the digital age, the diversity and quality of data have increased significantly, being available in various formats. Among these, the PDF (Portable Document Format) stands out as one of the most well-known and widely used, adapting to various sectors, such as business, education, and research. These files contain a considerable amount of structured data, such as text, headings, lists, tables, images, etc. The content of PDF files can be extracted using dedicated tools, such as OCR (Optical Character Recognition), PdfMiner, Tabula and others, which have proven to be suitable for this task. However, these tools may encounter difficulties when dealing with the complex and varied presentation of PDF documents. The accuracy of extraction can be compromised by the diversity of layouts, non-standardized formats, and embedded graphic elements in the documents, often leading to manual post-processing. Computer vision, and more specifically, object detection, is a branch of machine learning that aims to locate and classify instances in images using models dedicated to the task. It is proving to be a viable approach to accelerating the work performed by algorithms like OCR, PdfMiner, Tabula and improving their accuracy. Object detection models, being based on deep learning, require not only a substantial amount of data for training but, above all, high-quality annotations, as they have a direct impact on achieving high levels of accuracy and robustness. The diversity of layouts and graphic elements in PDF documents adds an additional layer of complexity, requiring representatively annotated data so that the models can learn to handle all possible variations. Considering the voluminous aspect of the data needed for training the models, we quickly realize that the data annotation process becomes a tedious and time-consuming task requiring human intervention to manually identify and label each relevant element. This task is not only time-consuming but also subject to human error, often requiring additional checks and corrections. To find a middle ground between the amount of data, minimizing annotation time, and high-quality annotations, in this work, we proposed a pipeline that, from a limited number of annotated PDF documents with the categories text, title, list, table, and image as input, can create new document layouts similar to the desired number by the user. This pipeline goes further by filling the new created layouts with content to provide synthetic document images and their respective annotations. With its simple, intuitive, and scalable structure, this pipeline can contribute to active learning, allowing detection models to be continuously trained, making them more effective and robust in the face of real documents. In our experiments, when evaluating and comparing three detection models, we observed that the RT-DETR (Real-Time Detection Transformer) achieved the best results, reaching a mean Average Precision (mAP) of 96.30 percent, surpassing the results of Mask R-CNN (Region-based Convolutional Neural Networks) and Mask DINO (Mask DETR with Improved Denoising Anchor Boxes). The superiority of RT-DETR indicates its potential to become a reference solution in detecting features in PDF documents. These promising results pave the way for more efficient and reliable applications in the automatic processing of documents.
17

Layout Analysis for Handwritten Documents. A Probabilistic Machine Learning Approach

Quirós Díaz, Lorenzo 21 March 2022 (has links)
[ES] El Análisis de la Estructura de Documentos (Document Layout Analysis), aplicado a documentos manuscritos, tiene como objetivo obtener automáticamente la estructura intrínseca de dichos documentos. Su desarrollo como campo de investigación se extiende desde los sistemas de segmentación de caracteres desarrollados a principios de la década de 1960 hasta los sistemas complejos desarrollados en la actualidad, donde el objetivo es analizar estructuras de alto nivel (líneas de texto, párrafos, tablas, etc.) y la relación que existe entre ellas. Esta tesis, en primer lugar, define el objetivo del Análisis de la Estructura de Documentos desde una perspectiva probabilística. A continuación, la complejidad del problema se reduce a un conjunto de subproblemas complementarios bien conocidos, de manera que pueda ser gestionado por medio de recursos informáticos modernos. Concretamente se abordan tres de los principales problemas del Análisis de la Estructura de Documentos siguiendo una formulación probabilística. Específicamente se aborda la Detección de Línea Base (Baseline Detection), la Segmentación de Regiones (Region Segmentation) y la Determinación del Orden de Lectura (Reading Order Determination). Uno de los principales aportes de esta tesis es la formalización de los problemas de Detección de Línea Base y Segmentación de Regiones bajo un marco probabilístico, donde ambos problemas pueden ser abordados por separado o de forma integrada por los modelos propuestos. Este último enfoque ha demostrado ser muy útil para procesar grandes colecciones de documentos con recursos informáticos limitados. Posteriormente se aborda el subproblema de la Determinación del Orden de Lectura, que es uno de los subproblemas más importantes, aunque subestimados, del Análisis de la Extructura de Documentos, ya que es el nexo que permite convertir los datos extraídos de los sistemas de Reconocimiento Automático de Texto (Automatic Text Recognition Systems) en información útil. Por lo tanto, en esta tesis abordamos y formalizamos la Determinación del Orden de Lectura como un problema de clasificación probabilística por pares. Además, se proponen dos diferentes algoritmos de decodificación que reducen la complejidad computacional del problema. Por otra parte, se utilizan diferentes modelos estadísticos para representar la distribución de probabilidad sobre la estructura de los documentos. Estos modelos, basados en Redes Neuronales Artificiales (desde un simple Perceptrón Multicapa hasta complejas Redes Convolucionales y Redes de Propuesta de Regiones), se estiman a partir de datos de entrenamiento utilizando algoritmos de aprendizaje automático supervisados. Finalmente, todas las contribuciones se evalúan experimentalmente, no solo en referencias académicas estándar, sino también en colecciones de miles de imágenes. Se han considerado documentos de texto manuascritos y documentos musicales manuscritos, ya que en conjunto representan la mayoría de los documentos presentes en bibliotecas y archivos. Los resultados muestran que los métodos propuestos son muy precisos y versátiles en una amplia gama de documentos manuscritos. / [CA] L'Anàlisi de l'Estructura de Documents (Document Layout Analysis), aplicada a documents manuscrits, pretén automatitzar l'obtenció de l'estructura intrínseca d'un document. El seu desenvolupament com a camp d'investigació comprén des dels sistemes de segmentació de caràcters creats al principi dels anys 60 fins als complexos sistemes de hui dia que busquen analitzar estructures d'alt nivell (línies de text, paràgrafs, taules, etc) i les relacions entre elles. Aquesta tesi busca, primer de tot, definir el propòsit de l'anàlisi de l'estructura de documents des d'una perspectiva probabilística. Llavors, una vegada reduïda la complexitat del problema, es processa utilitzant recursos computacionals moderns, per a dividir-ho en un conjunt de subproblemes complementaris més coneguts. Concretament, tres dels principals subproblemes de l'Anàlisi de l'Estructura de Documents s'adrecen seguint una formulació probabilística: Detecció de la Línia Base Baseline Detection), Segmentació de Regions (Region Segmentation) i Determinació de l'Ordre de Lectura (Reading Order Determination). Una de les principals contribucions d'aquesta tesi és la formalització dels problemes de la Detecció de les Línies Base i dels de Segmentació de Regions en un entorn probabilístic, sent els dos problemes tractats per separat o integrats en conjunt pels models proposats. Aquesta última aproximació ha demostrat ser de molta utilitat per a la gestió de grans col·leccions de documents amb uns recursos computacionals limitats. Posteriorment s'ha adreçat el subproblema de la Determinació de l'Ordre de Lectura, sent un dels subproblemes més importants de l'Anàlisi d'Estructures de Documents, encara així subestimat, perquè és el nexe que permet transformar en informació d'utilitat l'extracció de dades dels sistemes de reconeixement automàtic de text. És per això que el fet de determinar l'ordre de lectura s'adreça i formalitza com un problema d'ordenació probabilística per parells. A més, es proposen dos algoritmes descodificadors diferents que reducix la complexitat computacional del problema. Per altra banda s'utilitzen diferents models estadístics per representar la distribució probabilística sobre l'estructura dels documents. Aquests models, basats en xarxes neuronals artificials (des d'un simple perceptron multicapa fins a complexes xarxes convolucionals i de propostes de regió), s'estimen a partir de dades d'entrenament mitjançant algoritmes d'aprenentatge automàtic supervisats. Finalment, totes les contribucions s'avaluen experimentalment, no només en referents acadèmics estàndard, sinó també en col·leccions de milers d'imatges. S'han considerat documents de text manuscrit i documents musicals manuscrits, ja que representen la majoria de documents presents a biblioteques i arxius. Els resultats mostren que els mètodes proposats són molt precisos i versàtils en una àmplia gamma de documents manuscrits. / [EN] Document Layout Analysis, applied to handwritten documents, aims to automatically obtain the intrinsic structure of a document. Its development as a research field spans from the character segmentation systems developed in the early 1960s to the complex systems designed nowadays, where the goal is to analyze high-level structures (lines of text, paragraphs, tables, etc) and the relationship between them. This thesis first defines the goal of Document Layout Analysis from a probabilistic perspective. Then, the complexity of the problem is reduced, to be handled by modern computing resources, into a set of well-known complementary subproblems. More precisely, three of the main subproblems of Document Layout Analysis are addressed following a probabilistic formulation, namely Baseline Detection, Region Segmentation and Reading Order Determination. One of the main contributions of this thesis is the formalization of Baseline Detection and Region Segmentation problems under a probabilistic framework, where both problems can be handled separately or in an integrated way by the proposed models. The latter approach is proven to be very useful to handle large document collections under restricted computing resources. Later, the Reading Order Determination subproblem is addressed. It is one of the most important, yet underestimated, subproblem of Document Layout Analysis, since it is the bridge that allows us to convert the data extracted from Automatic Text Recognition systems into useful information. Therefore, Reading Order Determination is addressed and formalized as a pairwise probabilistic sorting problem. Moreover, we propose two different decoding algorithms that reduce the computational complexity of the problem. Furthermore, different statistical models are used to represent the probability distribution over the structure of the documents. These models, based on Artificial Neural Networks (from a simple Multilayer Perceptron to complex Convolutional and Region Proposal Networks), are estimated from training data using supervised Machine Learning algorithms. Finally, all the contributions are experimentally evaluated, not only on standard academic benchmarks but also in collections of thousands of images. We consider handwritten text documents and handwritten musical documents as they represent the majority of documents in libraries and archives. The results show that the proposed methods are very accurate and versatile in a very wide range of handwritten documents. / Quirós Díaz, L. (2022). Layout Analysis for Handwritten Documents. A Probabilistic Machine Learning Approach [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/181483

Page generated in 0.0633 seconds