Spelling suggestions: "subject:"document structure"" "subject:"ocument structure""
1 |
Web Search Based on Hierarchical Heading-Block Structure Analysis / 階層的な見出しブロック構造の分析に基づくWeb検索Manabe, Tomohiro 23 March 2016 (has links)
The contents of Section 2.2 and Chapter 4 first appeared in proceedings of the 12th International Conference on Web Information Systems and Technologies, 2016 (www.webist.org). The contents of Section 2.3 and Chapter 5 first appeared in DBSJ Journal, vol. 14, article no. 2, March 2016. The contents of Section 2.5 and Chapter 7 first appeared in proceedings of the 11th Asia Information Retrieval Societies Conference, Lecture Notes in Computer Science, vol. 9460, pp. 188-200, 2015 (The final publication is available at link.springer.com). / 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19854号 / 情博第605号 / 新制||情||105(附属図書館) / 32890 / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授 田島 敬史, 教授 田中 克己, 教授 吉川 正俊 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
2 |
Identificação automática de relações multidocumento / Automatic identification of multidocument relationsMaziero, Erick Galani 16 January 2012 (has links)
O tratamento multidocumento mostra-se indispensável no cenário atual das mídias eletrônicas, em que são produzidos diversos documentos sobre um mesmo tópico, principalmente quando se considera a explosão de informação permitida pela web. Tanto leitores quanto aplicações computacionais se beneficiam da análise discursiva multidocumento por meio da qual são explicitadas relações entre as porções dos documentos, por exemplo, relações de equivalência, contradição ou de contextualização de alguma informação. A fim de realizar o tratamento automático multidocumento, adota-se neste trabalho a teoria linguístico-computacional CST (Cross-document Structure Theory, Radev, 2000). Esse tipo de conhecimento multidocumento permite que (i) se tratem mais apropriadamente fenômenos como redundância, complementariedade e contradição de informações e, consequentemente, (ii) produzam-se sistemas melhores de processamento textual, como buscadores web mais inteligentes e sumarizadores automáticos. Neste trabalho é apresentada uma metodologia de identificação dessas relações explorando-se técnicas de aprendizado automático do paradigma tradicional e hierárquico. Para relações que não são passíveis de identificação por aprendizado automático foram desenvolvidas regras para sua identificação. Por fim, um parser é gerado contendo classificadores e regras / The multi-document treatment is essential in the current scenario of electronic media, in which many documents are produced about a same topic, mainly when considering the explosion of information allowed by the web. Both readers and computational applications are benefited by the discursive multi-document analysis, through which the relations (for example, equivalence, contradiction or background relations) among the portions of text are showed. In order to achieve the automatic multi-document treatment, the CST (Cross-document Structure Theory, Radev, 2000) is adopted in this work. This kind of knowledge allow (i) the appropriated treatment of phenomena like redundancy, complementarity and contradiction of information and, consequently, (ii) the production of better systems of text processing, as more intelligent web searchers and automatic summarizers. In this work, a methodology to identify these relations is presented exploring techniques of machine learning of the traditional and hierarchical paradigm. For relations with low frequency in the corpus, handcrafted rules were developed. Finally, a parser is generated containing classifiers and rules
|
3 |
Transformation de types dans les systèmes d'édition de documents structurésAkpotsui, Extase 26 October 1993 (has links) (PDF)
Les systemes d'edition de documents fondes sur la description logique<br />des composants des documents s'appuient sur les grammaires hors<br />contexte. Ces grammaires assez riches permettent la description des<br />classes de documents (schemas de structure), de leurs composants, des<br />relations hierarchiques et de voisinage que ces derniers entretiennent<br />les uns avec les autres, et d'informations d'ordre semantique<br />associees aux composants sous forme d'attributs. La verification<br />rigoureuse de la compatibilite de types, benefique par ailleurs,<br />induit aussi des inconvenients dont les principaux sont le rejet des<br />couper-coller, l'impossibilite d'editer les documents dont les<br />schemas de structure ont evolue, l'impossibilite de realiser des<br />restructurations en cours d'edition.<br /><br />Le but de cette these est d'etudier l'evolution des types, de<br />proposer des solutions a ces problemes et de les mettre en oeuvre dans<br />le systeme Grif.<br /><br />La these presente, dans une premiere partie, un etat de l'art et les<br />problemes de restructuration dans les systemes d'edition de documents<br />structures (SEDS) en general, notamment l'editeur Grif qui sert de<br />cadre a` cette etude.<br /><br />La deuxieme partie presente une typologie de l'evolution des<br />structures et des attributs et un modele conceptuel de conversion<br />automatique des instances de documents concernees par l'evolution<br />statique de structures. <br /><br />La troisieme partie de cette these presente en trois points un<br />formalisme des types pour les SEDS :<br /><br />1. Un modele mathematique de types dans les SEDS, fonde sur la<br />representation fonctionnelle des caracteristiques structurales des<br />types, qui permet d'exprimer avec rigueur les evolutions possibles de<br />structure.<br /><br />2. Un ensemble de definitions des relations structurales entre types<br />(facteur, sous-typage, massif, compatibilite, equivalence), utiles<br />dans les transformations statiques et dynamiques.<br /><br />3. Une approche grammaticale pour les transformations dynamiques : un<br />schema de structure peut etre transforme en une grammaire<br />algebrique, un document pouvant etre interprete comme un mot du<br />langage issu de cette grammaire. Le langage retenu par la presente<br />these est construit sur un alphabet terminal compose de l'ensemble des<br />types de base du systeme, de l'ensemble des identificateurs des<br />schemas de structure du systeme et de l'ensemble des symboles<br />exprimant la structure des types.
|
4 |
Identificação automática de relações multidocumento / Automatic identification of multidocument relationsErick Galani Maziero 16 January 2012 (has links)
O tratamento multidocumento mostra-se indispensável no cenário atual das mídias eletrônicas, em que são produzidos diversos documentos sobre um mesmo tópico, principalmente quando se considera a explosão de informação permitida pela web. Tanto leitores quanto aplicações computacionais se beneficiam da análise discursiva multidocumento por meio da qual são explicitadas relações entre as porções dos documentos, por exemplo, relações de equivalência, contradição ou de contextualização de alguma informação. A fim de realizar o tratamento automático multidocumento, adota-se neste trabalho a teoria linguístico-computacional CST (Cross-document Structure Theory, Radev, 2000). Esse tipo de conhecimento multidocumento permite que (i) se tratem mais apropriadamente fenômenos como redundância, complementariedade e contradição de informações e, consequentemente, (ii) produzam-se sistemas melhores de processamento textual, como buscadores web mais inteligentes e sumarizadores automáticos. Neste trabalho é apresentada uma metodologia de identificação dessas relações explorando-se técnicas de aprendizado automático do paradigma tradicional e hierárquico. Para relações que não são passíveis de identificação por aprendizado automático foram desenvolvidas regras para sua identificação. Por fim, um parser é gerado contendo classificadores e regras / The multi-document treatment is essential in the current scenario of electronic media, in which many documents are produced about a same topic, mainly when considering the explosion of information allowed by the web. Both readers and computational applications are benefited by the discursive multi-document analysis, through which the relations (for example, equivalence, contradiction or background relations) among the portions of text are showed. In order to achieve the automatic multi-document treatment, the CST (Cross-document Structure Theory, Radev, 2000) is adopted in this work. This kind of knowledge allow (i) the appropriated treatment of phenomena like redundancy, complementarity and contradiction of information and, consequently, (ii) the production of better systems of text processing, as more intelligent web searchers and automatic summarizers. In this work, a methodology to identify these relations is presented exploring techniques of machine learning of the traditional and hierarchical paradigm. For relations with low frequency in the corpus, handcrafted rules were developed. Finally, a parser is generated containing classifiers and rules
|
5 |
Finding Relevant PDF Medical Journal Articles by the Content of Their Figures as well as Their TextChristiansen, Ammon J. 17 April 2007 (has links) (PDF)
This work addresses the need for an alternative to keyword-based search for sifting through large PDF medical journal article document collections for literature review purposes. Despite users' best efforts to form precise and accurate queries, it is often difficult to guess the right keywords to find all the related articles while finding a minimum number of unrelated ones. Failure during literature review to find relevant, related research results in wasted research time and effort in addition to missing significant work in the related area which could affect the quality of the research work being conducted. The purpose of this work is to explore the benefits of a retrieval system for professional journal articles in PDF format that supports hybrid queries composed of both text and images. PDF medical journal articles contain formatting and layout information that imply the structure and organization of the document. They also contain figures and tables rich with content and meaning. Stripping a PDF into “full-text” for indexing purposes disregards these important features. Specifically, this work investigated the following: (1) what effect the incorporation of a document's embedded figures into the query (in addition to its text) has on retrieval performance (precision) compared to plain keyword-based search; (2) how current text-based document-query similarity methods can be enhanced by using formatting and font-size information as a structure and organization model for a PDF document; (3) whether to use the standard Euclidean distance function or the matrix distance function for content-based image retrieval; (4) how to convert a PDF into a structured, formatted, reflowable XML representation given a pure-layout PDF document; (5) what document views (such as a term frequency cloud, a document outline, or a document's figures) would help users wade through search results to quickly select those that are worth a closer look. While the results of the experiments were unexpectedly worse than their baselines of comparison (see the conclusion for a summary), the experimental methods are very valuable in showing others what directions have already been pursued and why they did not work and what remaining problems need to be solved in order to achieve the goal of improving literature review through use of a hybrid text and image retrieval system.
|
Page generated in 0.087 seconds