41 |
Automatische Sacherschließung an der ZBWGroß, Thomas 06 January 2012 (has links) (PDF)
Die ZBW möchte mit der Implementierung eines automatischen Sacherschließungsverfahrens einerseits dem Umstand einer stetigen Zunahme an Onlinedokumenten Rechnung tragen und andererseits bei der Inhaltserschließung neue Wege beschreiten. Neben der Entlastung der intellektuellen Erschließung durch ein semi- oder vollautomatisches Verfahren soll es darüber hinaus möglich sein, ZBW-fremde digitale Informationsressourcen jeglicher Art mit maschineller Hilfe zu indexieren und in einem gemeinsamen Suchraum auffindbar zu machen. Im derzeitigen Projekt werden hierzu die in der ZBW zur Anwendung kommenden Vokabulare (verbale Sacherschließung mit Standard-Thesaurus Wirtschaft, bzw. klassifikatorische Erschließung mit der Standardklassifikation Wirtschaft) für das maschinelle Verfahren angepasst, trainiert und evaluiert. Die Erfahrungen der ZBW mit der organisatorischen Implementierung automatischer Sacherschließung sowie die Möglichkeiten der Auswertung dieser Verfahren stehen im Mittelpunkt des Vortrages.
|
42 |
Dos sintagmas nominais aos descritores documentais: estudo de caso na indexação de teses e dissertações da área de DireitoNASCIMENTO, Gustavo Diniz 20 November 2015 (has links)
Submitted by Haroudo Xavier Filho (haroudo.xavierfo@ufpe.br) on 2016-05-19T18:03:16Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Dissertação. Dos Sintagmas Nominais aos Descritores Documentais Estudo de Caso na Indexação de Te.pdf: 3525255 bytes, checksum: 7cb6e4bda3f02eb17e2009285414748d (MD5) / Made available in DSpace on 2016-05-19T18:03:16Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Dissertação. Dos Sintagmas Nominais aos Descritores Documentais Estudo de Caso na Indexação de Te.pdf: 3525255 bytes, checksum: 7cb6e4bda3f02eb17e2009285414748d (MD5)
Previous issue date: 2015-11-20 / O uso dos Sintagmas Nominais como instrumentos de organização e recuperação da informação digital vem se mostrando uma alternativa promissora para sistemas de informação. Nesse contexto, a indexação automática por meio de Sintagmas Nominais se mostra como uma alternativa que minimiza alguns problemas encontrados na indexação baseada em palavras isoladas, uma vez que os Sintagmas Nominais se constituem em unidades sintáticas que possuem semântica/sentido específico. No entanto, é notório que nem todos os Sintagmas Nominais que se encontram em um documento digital são representativos do mesmo, o que demonstra por sua vez a necessidade de uma seleção dos Sintagmas Nominais que realmente possam funcionar como descritores documentais. Nesse contexto, o presente trabalho tem como objetivo geral investigar a seleção de sintagmas nominais com valor de descritor no contexto do processo de indexação automática por meio de sintagmas nominais de resumos de teses e dissertações em português da área jurídica. Pretende-se: 1. Investigar o processo de indexação automática por meio de sintagmas nominais; 2. Verificar quais são as características de um Sintagma Nominal como valor de descritor documental; 3. Identificar na literatura científica nacional metodologias para seleção de sintagmas nominais em textos em português, bem como os critérios de seleção de cada metodologia; 4. Planejar experimento, onde os Sintagmas Nominais extraídos são categorizados quanto ao atendimento ou não a critérios de seleção propostos na literatura e quanto ao valor como Descritores, quando semelhantes aos descritores documentais resultantes da indexação manual; 5. Avaliar os critérios de seleção na indexação automática por meio de Sintagmas Nominais para teses e dissertações da área jurídica. Para o alcance dos objetivos propostos, fez-se uso de uma pesquisa bibliográfica e de um experimento. A pesquisa bibliográfica permitiu a identificação de pesquisas voltadas para a indexação automática por meio de Sintagmas Nominais, principalmente no que se refere à seleção de Sintagmas que funcionem como descritores documentais. Com base nas leituras dessas pesquisas, puderam-se identificar vários critérios utilizados para a seleção de Sintagmas. O experimento constituiu-se na aplicação dos critérios identificados na literatura aos sintagmas nominais extraídos de um conjunto de resumos de teses e dissertações da área jurídica, com o intuito de mensurar a utilidade ou não desses critérios no que diz respeito à seleção de sintagmas nominais descritores. Assim, realizou-se a indexação manual dos documentos, a extração automática dos Sintagmas Nominais constituintes dos mesmos, a categorização dos sintagmas nominais como descritores com base na semelhança aos descritores documentais advindos da indexação manual e a aplicação dos critérios de seleção aos sintagmas nominais extraídos. Por meio do experimento, foram percebidos comportamentos distintos entre cada critério, onde a maioria foi considerada útil para a seleção de sintagmas nominais. / The use of noun phrases as tools for information organization and retrieval has proven to be a promising alternative in information systems. In this context, automatic indexing through the noun phrases shows a new alternative that minimizes some problems encountered in indexing based on single words, given that noun phrases are syntactic units with specific semantic/meaning. However, it is evident that not all the noun phrases in a digital document are representative of the content, which in turn demonstrates the need for a selection of noun phrases that may serve as documentary descriptors. In this context, this work aims to investigate the selection of noun phrases with descriptor value in the context of the automatic indexing process by noun phrases from abstracts of theses and dissertations in the area of law in Portuguese. The specific objectives are: 1. Investigate the automatic indexing process by noun phrases; 2. Identify what are the characteristics of a noun phrase with descriptor value; 3. Identify the methodologies in national scientific literature for selection of noun phrases in texts in Portuguese, as well as the selection criteria of each methodology; 4. Organization of an experiment where each extracted noun phrase is categorized as descriptor and if it meets or not the proposed selection criteria in the literature; 5. Evaluate the selection criteria in the automatic indexing by noun phrases for theses and dissertations in the legal field. To achieve the objectives, it was made use of a bibliographic research and experiment. The literature review allowed the identification of research on automatic indexing by noun phrases, especially criteria used to choose noun phrases that act as documentary descriptors. Based on the readings of this research, it was possible to identify several criteria used for the selection of noun phrases. The experiment consisted in applying the criteria identified in the literature to noun phrases extracted from a set of abstracts of theses and dissertations in the legal field, in order to measure the usefulness or not of these criteria for the selection of noun phrases descriptors. Thus it was made manual indexing of documents, automatic extraction of noun phrases thereof, the categorization of noun phrases as descriptors based on their resemblance to the descriptors of manual indexing and the application of criteria in the extracted noun phrases. Through the experiment, we notice a different behavior for each criterion, where most of the criteria was considered useful for noun phrase selection.
|
43 |
Lingo – ein System zur automatischen Indexierung – Anwendung und Einsatzmöglichkeiten: Lingo – ein System zur automatischen Indexierung –Anwendung und EinsatzmöglichkeitenMüller, Thomas 26 January 2011 (has links)
Die heterogenen musealen Bestände (Text, Bild, gegenständliche Objekte)
im Haus der Geschichte der Bundesrepublik Deutschland umfassen
derzeit über 365.000 Objektbeschreibungen zeithistorischer Objekte. Auf
der Basis des Open Source Indexierungssystems lingo wird eine automatische
Indexierung entwickelt, die - aufsetzend auf den existierenden
Rahmenbedingungen - normierte Beschreibungsmerkmale generiert und
als Indexterme für das Retrieval zur Verfügung stellt. Zielvorstellung ist
es, eine einheitliche Suche über die Objektbeschreibungen anhand der
sprachlichen und semantischen Vereinheitlichung der Indexterme zu realisieren.
|
44 |
Sjednocování věcného popisu agregovaných záznamů v repozitáři NUŠL / Unification of Subject Description of Aggregated Records in National Repository of Grey LiteratureCharvátová, Michaela January 2016 (has links)
The diploma thesis focuses on subject description unification methods in records aggregated from different sources in digital repositories, using the example of the National Repository of Grey Literature (NRGL). After presenting experiences with systems BASE and LASSO abroad, I describe the current situation in NRGL, where the automatic indexing is used to assign each record a unified subject heading from the Polythematic Structured Subject Heading System (PSSHS). The thesis then presents how the MeSH thesaurus and Conspectus categorization scheme were mapped to PSSHS. These mappings were then applied to records from the National Medical Library. The aim of the experiment was to compare the subject description consisting of PSSHS subject headings created by automatic indexing, and the subject description created by mapping. In addition to that I explore the possibilities of mapping author keywords in records of academic theses. Powered by TCPDF (www.tcpdf.org)
|
45 |
Information spotting in huge repositories of scanned document images / Localisation d'information dans des très grands corpus de documents numérisésDang, Quoc Bao 06 April 2018 (has links)
Ce travail vise à développer un cadre générique qui est capable de produire des applications de localisation d'informations à partir d’une caméra (webcam, smartphone) dans des très grands dépôts d'images de documents numérisés et hétérogènes via des descripteurs locaux. Ainsi, dans cette thèse, nous proposons d'abord un ensemble de descripteurs qui puissent être appliqués sur des contenus aux caractéristiques génériques (composés de textes et d’images) dédié aux systèmes de recherche et de localisation d'images de documents. Nos descripteurs proposés comprennent SRIF, PSRIF, DELTRIF et SSKSRIF qui sont construits à partir de l’organisation spatiale des points d’intérêts les plus proches autour d'un point-clé pivot. Tous ces points sont extraits à partir des centres de gravité des composantes connexes de l‘image. A partir de ces points d’intérêts, des caractéristiques géométriques invariantes aux dégradations sont considérées pour construire nos descripteurs. SRIF et PSRIF sont calculés à partir d'un ensemble local des m points d’intérêts les plus proches autour d'un point d’intérêt pivot. Quant aux descripteurs DELTRIF et SSKSRIF, cette organisation spatiale est calculée via une triangulation de Delaunay formée à partir d'un ensemble de points d’intérêts extraits dans les images. Cette seconde version des descripteurs permet d’obtenir une description de forme locale sans paramètres. En outre, nous avons également étendu notre travail afin de le rendre compatible avec les descripteurs classiques de la littérature qui reposent sur l’utilisation de points d’intérêts dédiés de sorte qu'ils puissent traiter la recherche et la localisation d'images de documents à contenu hétérogène. La seconde contribution de cette thèse porte sur un système d'indexation de très grands volumes de données à partir d’un descripteur volumineux. Ces deux contraintes viennent peser lourd sur la mémoire du système d’indexation. En outre, la très grande dimensionnalité des descripteurs peut amener à une réduction de la précision de l'indexation, réduction liée au problème de dimensionnalité. Nous proposons donc trois techniques d'indexation robustes, qui peuvent toutes être employées sans avoir besoin de stocker les descripteurs locaux dans la mémoire du système. Cela permet, in fine, d’économiser la mémoire et d’accélérer le temps de recherche de l’information, tout en s’abstrayant d’une validation de type distance. Pour cela, nous avons proposé trois méthodes s’appuyant sur des arbres de décisions : « randomized clustering tree indexing” qui hérite des propriétés des kd-tree, « kmean-tree » et les « random forest » afin de sélectionner de manière aléatoire les K dimensions qui permettent de combiner la plus grande variance expliquée pour chaque nœud de l’arbre. Nous avons également proposé une fonction de hachage étendue pour l'indexation de contenus hétérogènes provenant de plusieurs couches de l'image. Comme troisième contribution de cette thèse, nous avons proposé une méthode simple et robuste pour calculer l'orientation des régions obtenues par le détecteur MSER, afin que celui-ci puisse être combiné avec des descripteurs dédiés. Comme la plupart de ces descripteurs visent à capturer des informations de voisinage autour d’une région donnée, nous avons proposé un moyen d'étendre les régions MSER en augmentant le rayon de chaque région. Cette stratégie peut également être appliquée à d'autres régions détectées afin de rendre les descripteurs plus distinctifs. Enfin, afin d'évaluer les performances de nos contributions, et en nous fondant sur l'absence d'ensemble de données publiquement disponibles pour la localisation d’information hétérogène dans des images capturées par une caméra, nous avons construit trois jeux de données qui sont disponibles pour la communauté scientifique. / This work aims at developing a generic framework which is able to produce camera-based applications of information spotting in huge repositories of heterogeneous content document images via local descriptors. The targeted systems may take as input a portion of an image acquired as a query and the system is capable of returning focused portion of database image that match the query best. We firstly propose a set of generic feature descriptors for camera-based document images retrieval and spotting systems. Our proposed descriptors comprise SRIF, PSRIF, DELTRIF and SSKSRIF that are built from spatial space information of nearest keypoints around a keypoints which are extracted from centroids of connected components. From these keypoints, the invariant geometrical features are considered to be taken into account for the descriptor. SRIF and PSRIF are computed from a local set of m nearest keypoints around a keypoint. While DELTRIF and SSKSRIF can fix the way to combine local shape description without using parameter via Delaunay triangulation formed from a set of keypoints extracted from a document image. Furthermore, we propose a framework to compute the descriptors based on spatial space of dedicated keypoints e.g SURF or SIFT or ORB so that they can deal with heterogeneous-content camera-based document image retrieval and spotting. In practice, a large-scale indexing system with an enormous of descriptors put the burdens for memory when they are stored. In addition, high dimension of descriptors can make the accuracy of indexing reduce. We propose three robust indexing frameworks that can be employed without storing local descriptors in the memory for saving memory and speeding up retrieval time by discarding distance validating. The randomized clustering tree indexing inherits kd-tree, kmean-tree and random forest from the way to select K dimensions randomly combined with the highest variance dimension from each node of the tree. We also proposed the weighted Euclidean distance between two data points that is computed and oriented the highest variance dimension. The secondly proposed hashing relies on an indexing system that employs one simple hash table for indexing and retrieving without storing database descriptors. Besides, we propose an extended hashing based method for indexing multi-kinds of features coming from multi-layer of the image. Along with proposed descriptors as well indexing frameworks, we proposed a simple robust way to compute shape orientation of MSER regions so that they can combine with dedicated descriptors (e.g SIFT, SURF, ORB and etc.) rotation invariantly. In the case that descriptors are able to capture neighborhood information around MSER regions, we propose a way to extend MSER regions by increasing the radius of each region. This strategy can be also applied for other detected regions in order to make descriptors be more distinctive. Moreover, we employed the extended hashing based method for indexing multi-kinds of features from multi-layer of images. This system are not only applied for uniform feature type but also multiple feature types from multi-layers separated. Finally, in order to assess the performances of our contributions, and based on the assessment that no public dataset exists for camera-based document image retrieval and spotting systems, we built a new dataset which has been made freely and publicly available for the scientific community. This dataset contains portions of document images acquired via a camera as a query. It is composed of three kinds of information: textual content, graphical content and heterogeneous content.
|
46 |
Análise de redes sociais em dados bibliográficos / Social network analysis on bibliographical dataPacheco, Urubatan Rocha 11 August 2010 (has links)
Orientador: Ricardo de Oliveira Anido / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-17T02:21:18Z (GMT). No. of bitstreams: 1
Pacheco_UrubatanRocha_M.pdf: 1174940 bytes, checksum: d2b5f4af6749eb4a1c7c6a1810b9749a (MD5)
Previous issue date: 2010 / Resumo: O foco deste trabalho é viabilizar a análise estrutural em redes sociais de colaboração científica a partir de bases de dados bibliográficos. Os dados bibliográficos são utilizados para obter redes sociais de afiliação dos autores a instituições de pesquisa científica, e das publicações são extraídas as suas relações com ontologias de áreas de pesquisa. Foram estudados e aplicados métodos que utilizam a análise das redes sociais para solução/redução de ambiguidades em identidades de nomes de pesquisadores, instituições, e veículos científicos. Outro assunto estudado foi a abordagem de medida da qualidade dos resultados e os problemas que afetam a sua qualidade. Concretizando o objetivo deste trabalho, foram construídas métricas e ferramentas que permitem a comparação da produção científica entre instituições, departamentos, áreas de pesquisa, países, etc. As ferramentas também produziram um ranking de universidades baseado no prestígio dos pesquisadores destas universidades na rede social de co-autoria. Este resultado permitiu demonstrar que a informação estrutural de prestígio foi devidamente capturada ao correlacionar este ranking com outros que avaliam a qualidade da produção científica das universidades utilizando critérios semelhantes. / Abstract: This work performs social network analysis of the scientific collaborations extracted from bibliographic data bases. The analysis also includes the authors' scientific institution afiliation, and its relation with the main scientific publications and with research subject ontologies. We studied and applied methods that use social network analysis to solve or mitigate the problem of ambiguity in researchers' identities. We also applied the methods for ambiguity resolution for names of institutions, scientific meeting venues, country/state names, etc. Another study subject was measuring the quality of the results. Finally we developed metrics and implemented tools that allow the comparison of the scientific production of institutions, researcher groups, research subjects fields, countries, etc. The tools also produced a ranking of universities based on the prestige of these universities researchers at the co-authorship social network. These results demonstrated that prestige structural information was properly captured showing its correlation with other works that assess the quality of scientific production of universities using similar criteria. / Mestrado / Metodologia e Tecnicas da Computação / Mestre em Ciência da Computação
|
47 |
Automatische Sacherschließung an der ZBW: Status quo & AusblickGroß, Thomas 06 January 2012 (has links)
Die ZBW möchte mit der Implementierung eines automatischen Sacherschließungsverfahrens einerseits dem Umstand einer stetigen Zunahme an Onlinedokumenten Rechnung tragen und andererseits bei der Inhaltserschließung neue Wege beschreiten. Neben der Entlastung der intellektuellen Erschließung durch ein semi- oder vollautomatisches Verfahren soll es darüber hinaus möglich sein, ZBW-fremde digitale Informationsressourcen jeglicher Art mit maschineller Hilfe zu indexieren und in einem gemeinsamen Suchraum auffindbar zu machen. Im derzeitigen Projekt werden hierzu die in der ZBW zur Anwendung kommenden Vokabulare (verbale Sacherschließung mit Standard-Thesaurus Wirtschaft, bzw. klassifikatorische Erschließung mit der Standardklassifikation Wirtschaft) für das maschinelle Verfahren angepasst, trainiert und evaluiert. Die Erfahrungen der ZBW mit der organisatorischen Implementierung automatischer Sacherschließung sowie die Möglichkeiten der Auswertung dieser Verfahren stehen im Mittelpunkt des Vortrages.
|
Page generated in 0.1166 seconds