Spelling suggestions: "subject:"forminformation retrieval"" "subject:"informationation retrieval""
551 |
Análise de métodos de produção de interfaces visuais para recuperação da informaçãoXavier, Raphael Figueiredo [UNESP] 29 September 2009 (has links) (PDF)
Made available in DSpace on 2014-06-11T19:26:43Z (GMT). No. of bitstreams: 0
Previous issue date: 2009-09-29Bitstream added on 2014-06-13T20:34:34Z : No. of bitstreams: 1
xavier_rf_me_mar.pdf: 1231843 bytes, checksum: d4c761feb071f93faee1532f9b12c4b3 (MD5) / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / O advento da Web e o conseqüente aumento no volume de informações eletrônicas acarretaram muitos problemas em relação ao acesso, busca, localização e recuperação de informação em grandes volumes de dados. O presente trabalho realiza uma revisão dos diferentes modelos, métodos e algoritmos existentes para a geração de Interfaces Visuais para Recuperação da Informação, classificados segundo ao seu processo de produção: Análise e Transformação dos Dados, Aplicação de Algoritmos de Classificação e Distribuição Visual e Aplicação de Técnicas de Transformação Visual. Os resultados pretendem servir a outros investigadores como ferramenta para a eleição de uma ou outra combinação metodológica no desenvolvimento de propostas específicas de Interfaces Visuais para Recuperação da Informação, além de sugerir a necessidade de maiores investigações sobre novas técnicas de transformação visual. / The advent of the Web and the consequent increase in the volume of electronic information had caused many problems about access, search, location and retrieval of information in large volumes of data. This work is a revision of the different models, methods and algorithms to create interfaces for Visual Information Retrieval, classified according to their production process: Analysis and Data Processing, Implementation of algorithms for classification and distribution of Visual and Application Processing Techniques of Visual. The results of other researchers want to serve as a tool for the election of one or another combination methodology in the development of specific proposals for visual interfaces for information retrieval, and suggest the need for more research into new techniques for processing visual.
|
552 |
Ontologia como interface de apresentação de resultados de busca: uma proposta baseada no modelo espaço vetorial / Ontology as an interface of presentation of search results: a proposal for a vector space modelLopes, Tatiane dos Santos de Freitas [UNESP] 31 August 2017 (has links)
Submitted by TATIANE DOS SANTOS DE FREITAS LOPES null (thaty_lopez@hotmail.com) on 2017-09-26T11:27:52Z
No. of bitstreams: 1
Dissertação - Tatiane Lopes.pdf: 3190468 bytes, checksum: 328cfdb2f66173743a997091b892cd2b (MD5) / Approved for entry into archive by Monique Sasaki (sayumi_sasaki@hotmail.com) on 2017-09-28T12:25:17Z (GMT) No. of bitstreams: 1
lopes_tsf_me_mar.pdf: 3190468 bytes, checksum: 328cfdb2f66173743a997091b892cd2b (MD5) / Made available in DSpace on 2017-09-28T12:25:17Z (GMT). No. of bitstreams: 1
lopes_tsf_me_mar.pdf: 3190468 bytes, checksum: 328cfdb2f66173743a997091b892cd2b (MD5)
Previous issue date: 2017-08-31 / Um sistema de recuperação de informação é um elemento mediador entre um acervo documental e os usuários que buscam por documentos relevantes. Nesse contexto, as interfaces desempenham uma função importante: em um primeiro momento, auxiliando o usuário na tarefa de expressar a sua necessidade de informação por meio de uma expressão de busca e, em um segundo momento, fornecendo recursos para ajudá-lo a selecionar documentos relevantes dentre os resultados obtidos. A recuperação de informação é um processo linguístico cuja eficiência depende de coincidências terminológicas entre a expressão de busca do usuário e a representação dos documentos. Este trabalho propõe um modelo de interface na qual a estrutura terminológica de uma ontologia é utilizada para auxiliar o usuário na seleção de documentos relevantes dentre aqueles resultantes de sua busca. Caracteriza-se como uma pesquisa de natureza aplicada, e exploratória e bibliográfica quanto aos procedimentos. Conclui-se que a apresentação visual de uma ontologia permite o desenvolvimento de interfaces dinâmicas e interativas, proporcionando ao usuário uma navegação estimulante e prazerosa por entre os documentos resultantes de sua busca, tendo por base os termos de uma determinada área de conhecimento. / An information retrieval system is a mediating element between a document collection and the users who looking for relevant documents. In this context, interfaces play an important role: firstly, assisting the user to expressing their information need by means of a search expression, and secondly by providing resources to help selecting relevant documents from the obtained results. The information retrieval is a linguistic process whose efficiency depends on terminological coincidences between the user’s query and the representation of documents. This work proposes an interface model in which the terminological structure of an ontology is used to assist the user in the selection of relevant documents among those resulting from their search. It is characterized as an applied, exploratory and bibliographic research. It is concluded that the visual presentation of ontology allows the development of dynamic and interactive interfaces, providing the user with stimulating and pleasant navigation among the documents resulting from their search, based on the terms of a certain knowledge area.
|
553 |
Extração de relações semanticas via análise de correlação de termos em documentos / Extracting semantic relations via analysis of correlated terms in documentsBotero, Sergio William 12 December 2008 (has links)
Orientador: Ivan Luiz Marques Ricarte / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Eletrica e de Computação / Made available in DSpace on 2018-08-12T17:41:25Z (GMT). No. of bitstreams: 1
Botero_SergioWilliam_M.pdf: 2163763 bytes, checksum: a7c5db625a3d99cead80cee63b7908ce (MD5)
Previous issue date: 2008 / Resumo: Sistemas de recuperação de informação são ferramentas para automatizar os procedimentos de busca por informações. Surgiram com propostas simples nas quais a recuperação era baseada exclusivamente na sintaxe das palavras e evoluíram para sistemas baseados na semântica das palavras como, por exemplo, os que utilizam ontologias. Entretanto, a especificação manual de ontologias é uma tarefa extremamente custosa e sujeita a erros humanos. Métodos automáticos para a construção de ontologias mostraram-se ineficientes, identificando falsas relações semânticas. O presente trabalho apresenta uma técnica baseada em processamento de linguagem natural e um novo algoritmo de agrupamento para a extração semi-automática de relações que utiliza o conteúdo dos documentos, uma ontologia de senso comum e supervisão do usuário para identificar corretamente as relações semânticas. A proposta envolve um estágio que utiliza recursos lingüísticos para a extração de termos e outro que utiliza algoritmos de agrupamento para a identificação de conceitos e relações semânticas de instanciação entre termos e conceitos. O algoritmo proposto é baseado em técnicas de agrupamento possibilístico e de bi-agrupamento e permite a extração interativa de conceitos e relações. Os resultados são promissores, similares às metodologias mais recentes, com a vantagem de permitir a supervisão do processo de extração / Abstract: Information Retrieval systems are tools to automate the searching for information. The first implementations were very simple, based exclusively on word syntax, and have evolved to systems that use semantic knowledge such as those using ontologies. However, the manual specification is an expensive task and subject to human mistakes. In order to deal with this problem, methodologies that automatically construct ontologies have been proposed but they did not reach good results, identifying false semantic relation between words. This work presents a natural language processing technique e a new clustering algorithm for the semi-automatic extraction of semantic relations by using the content of the document, a commom-sense ontology, and the supervision of the user to correctly identify semantic relations. The proposal encompasses a stage that uses linguistic resources to extract the terms and another stage that uses clustering algorithms to identify concepts and instanceof relations between terms and concepts. The proposed algorithm is based on possibilistic clustering and bi-clustering techniques and it allows the interative extraction of concepts. The results are promising, similar to the most recent methodologies, with the advantage of allowing the supervision of the extraction process / Mestrado / Engenharia de Computação / Mestre em Engenharia Elétrica
|
554 |
Recuperação de informação: análise sobre a contribuição da ciência da computação para a ciência da informação / Information Retrieval: analysis about the contribution of Computer Science to Information ScienceEdberto Ferneda 15 December 2003 (has links)
Desde o seu nascimento, a Ciência da Informação vem estudando métodos para o tratamento automático da informação. Esta pesquisa centrou-se na Recuperação de Informação, área que envolve a aplicação de métodos computacionais no tratamento e recuperação da informação, para avaliar em que medida a Ciência da Computação contribui para o avanço da Ciência da Informação. Inicialmente a Recuperação de Informação é contextualizada no corpo interdisciplinar da Ciência da Informação e são apresentados os elementos básicos do processo de recuperação de informação. Os modelos computacionais de recuperação de informação são analisados a partir da categorização em quantitativos e dinâmicos. Algumas técnicas de processamento da linguagem natural utilizadas na recuperação de informação são igualmente discutidas. No contexto atual da Web são apresentadas as técnicas de representação e recuperação da informação desde os mecanismos de busca até a Web Semântica. Conclui-se que, apesar da inquestionável importância dos métodos e técnicas computacionais no tratamento da informação, estas se configuram apenas como ferramentas auxiliares, pois utilizam uma conceituação de informação extremamente restrita em relação àquela utilizada pela Ciência da Informação / Since its birth, Information Science has been studying methods for the automatic treatment of information. This research has focused on Information Retrieval, an area that involves the application of computational methods in the treatment and retrieval of information, in order to assess how Computer Science contributes to the progress of Information Science. Initially, Information Retrieval is contextualized in the interdisciplinary body of Information Science and, after that, the basic elements of the information retrieval process are presented. Computational models related to information retrieval are analyzed according to "quantitative" and "dynamic" categories. Some natural language processing techniques used in information retrieval are equally discussed. In the current context of the Web, the techniques of information retrieval are presented, from search engines to the Semantic Web. It can be concluded that in spite of the unquestionable importance of the computational methods and techniques for dealing with information, they are regarded only as auxiliary tools, because their concept of "information" is extremely restrict in relation to that used by the Information Science.
|
555 |
Cross-view Embeddings for Information RetrievalGupta, Parth Alokkumar 03 March 2017 (has links)
In this dissertation, we deal with the cross-view tasks related to information retrieval
using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script
IR, which deals with the challenges faced by an IR system when a language is written
in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character
n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an
abstract space and CAE provides the state-of-the-art performance.
We study a wide variety of models for cross-language information retrieval (CLIR)
and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many
CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language
plagiarism detection. We empirically test the proposed models for these tasks on
publicly available datasets and present the results with analyses.
In this dissertation, we also explore an effective method to incorporate contextual
similarity for lexical selection in machine translation. Concretely, we investigate a
feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the
strong baselines for English-to-Spanish and English-to-Hindi translation tasks.
Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this,
we propose two metrics based on reconstruction capabilities of the autoencoders:
structure preservation index (SPI) and similarity accumulation index (SAI). We also
introduce a concept of critical bottleneck dimensionality (CBD) below which the
structural information is lost and present analyses linking CBD and language perplexity. / En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de características finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia.
En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes.
En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. Específicamente, proponemos características extraídas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las características propuestas demuestra mejoras estadísticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú.
Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el índice de preservación de estructura (SPI, por sus siglas en inglés) y el índice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crítica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua. / En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de característiques finit i reduït, composat per n-grames de caràcters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda.
En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anàlisis corresponents.
En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semàntica de contextos en el procés de selecció lèxica en traducció automàtica. Específicament, proposem característiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les característiques proposades demostra millores estadísticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú.
Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'índex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'índex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crítica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua. / Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457
|
556 |
Information Retrieval in der Lehre / Teaching Information Retrieval - Supporting the Acquisition of Practical Knowledge About Information Retrieval Components Using Real-World Experiments and Game MechanicsWilhelm-Stein, Thomas 26 May 2016 (has links) (PDF)
Das Thema Information Retrieval hat insbesondere in Form von Internetsuchmaschinen eine große Bedeutung erlangt. Retrievalsysteme werden für eine Vielzahl unterschiedlicher Rechercheszenarien eingesetzt, unter anderem für firmeninterne Supportdatenbanken, aber auch für die Organisation persönlicher E-Mails.
Eine aktuelle Herausforderung besteht in der Bestimmung und Vorhersage der Leistungsfähigkeit einzelner Komponenten dieser Retrievalsysteme, insbesondere der komplexen Wechselwirkungen zwischen ihnen. Für die Implementierung und Konfiguration der Retrievalsysteme und der Retrievalkomponenten werden Fachleute benötigt. Mithilfe der webbasierten Lernanwendung Xtrieval Web Lab können Studierende praktisches Wissen über den Information Retrieval Prozess erwerben, indem sie Retrievalkomponenten zu einem Retrievalsystem zusammenstellen und evaluieren, ohne dafür eine Programmiersprache einsetzen zu müssen. Spielemechaniken leiten die Studierenden bei ihrem Entdeckungsprozess an, motivieren sie und verhindern eine Informationsüberladung durch eine Aufteilung der Lerninhalte. / Information retrieval has achieved great significance in form of search engines for the Internet. Retrieval systems are used in a variety of research scenarios, including corporate support databases, but also for the organization of personal emails.
A current challenge is to determine and predict the performance of individual components of these retrieval systems, in particular the complex interactions between them. For the implementation and configuration of retrieval systems and retrieval components professionals are needed. By using the web-based learning application Xtrieval Web Lab students can gain practical knowledge about the information retrieval process by arranging retrieval components in a retrieval system and their evaluation without using a programming language. Game mechanics guide the students in their discovery process, motivate them and prevent information overload by a partition of the learning content.
|
557 |
Information Retrieval in der Lehre: Unterstützung des Erwerbs von Praxiswissen zu Information Retrieval Komponenten mittels realer Experimente und SpielemechanikenWilhelm-Stein, Thomas 26 May 2016 (has links)
Das Thema Information Retrieval hat insbesondere in Form von Internetsuchmaschinen eine große Bedeutung erlangt. Retrievalsysteme werden für eine Vielzahl unterschiedlicher Rechercheszenarien eingesetzt, unter anderem für firmeninterne Supportdatenbanken, aber auch für die Organisation persönlicher E-Mails.
Eine aktuelle Herausforderung besteht in der Bestimmung und Vorhersage der Leistungsfähigkeit einzelner Komponenten dieser Retrievalsysteme, insbesondere der komplexen Wechselwirkungen zwischen ihnen. Für die Implementierung und Konfiguration der Retrievalsysteme und der Retrievalkomponenten werden Fachleute benötigt. Mithilfe der webbasierten Lernanwendung Xtrieval Web Lab können Studierende praktisches Wissen über den Information Retrieval Prozess erwerben, indem sie Retrievalkomponenten zu einem Retrievalsystem zusammenstellen und evaluieren, ohne dafür eine Programmiersprache einsetzen zu müssen. Spielemechaniken leiten die Studierenden bei ihrem Entdeckungsprozess an, motivieren sie und verhindern eine Informationsüberladung durch eine Aufteilung der Lerninhalte. / Information retrieval has achieved great significance in form of search engines for the Internet. Retrieval systems are used in a variety of research scenarios, including corporate support databases, but also for the organization of personal emails.
A current challenge is to determine and predict the performance of individual components of these retrieval systems, in particular the complex interactions between them. For the implementation and configuration of retrieval systems and retrieval components professionals are needed. By using the web-based learning application Xtrieval Web Lab students can gain practical knowledge about the information retrieval process by arranging retrieval components in a retrieval system and their evaluation without using a programming language. Game mechanics guide the students in their discovery process, motivate them and prevent information overload by a partition of the learning content.
|
558 |
Collecte orientée sur le Web pour la recherche d’information spécialisée / Focused document gathering on the Web for domain-specific information retrievalDe Groc, Clément 05 June 2013 (has links)
Les moteurs de recherche verticaux, qui se concentrent sur des segments spécifiques du Web, deviennent aujourd'hui de plus en plus présents dans le paysage d'Internet. Les moteurs de recherche thématiques, notamment, peuvent obtenir de très bonnes performances en limitant le corpus indexé à un thème connu. Les ambiguïtés de la langue sont alors d'autant plus contrôlables que le domaine est bien ciblé. De plus, la connaissance des objets et de leurs propriétés rend possible le développement de techniques d'analyse spécifiques afin d'extraire des informations pertinentes.Dans le cadre de cette thèse, nous nous intéressons plus précisément à la procédure de collecte de documents thématiques à partir du Web pour alimenter un moteur de recherche thématique. La procédure de collecte peut être réalisée en s'appuyant sur un moteur de recherche généraliste existant (recherche orientée) ou en parcourant les hyperliens entre les pages Web (exploration orientée).Nous étudions tout d'abord la recherche orientée. Dans ce contexte, l'approche classique consiste à combiner des mot-clés du domaine d'intérêt, à les soumettre à un moteur de recherche et à télécharger les meilleurs résultats retournés par ce dernier.Après avoir évalué empiriquement cette approche sur 340 thèmes issus de l'OpenDirectory, nous proposons de l'améliorer en deux points. En amont du moteur de recherche, nous proposons de formuler des requêtes thématiques plus pertinentes pour le thème afin d'augmenter la précision de la collecte. Nous définissons une métrique fondée sur un graphe de cooccurrences et un algorithme de marche aléatoire, dans le but de prédire la pertinence d'une requête thématique. En aval du moteur de recherche, nous proposons de filtrer les documents téléchargés afin d'améliorer la qualité du corpus produit. Pour ce faire, nous modélisons la procédure de collecte sous la forme d'un graphe triparti et appliquons un algorithme de marche aléatoire biaisé afin d'ordonner par pertinence les documents et termes apparaissant dans ces derniers.Dans la seconde partie de cette thèse, nous nous focalisons sur l'exploration orientée du Web. Au coeur de tout robot d'exploration orientée se trouve une stratégie de crawl qui lui permet de maximiser le rapatriement de pages pertinentes pour un thème, tout en minimisant le nombre de pages visitées qui ne sont pas en rapport avec le thème. En pratique, cette stratégie définit l'ordre de visite des pages. Nous proposons d'apprendre automatiquement une fonction d'ordonnancement indépendante du thème à partir de données existantes annotées automatiquement. / Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithms and the user interface can take advantage of domain knowledge, such as domain objects or characteristics, to satisfy user information needs.In this thesis, we tackle the first inevitable step of a all topical search engine : focused document gathering from the Web. A thorough study of the state of art leads us to consider two strategies to gather topical documents from the Web: either relying on an existing search engine index (focused search) or directly crawling the Web (focused crawling).The first part of our research has been dedicated to focused search. In this context, a standard approach consists in combining domain-specific terms into queries, submitting those queries to a search engine and down- loading top ranked documents. After empirically evaluating this approach over 340 topics, we propose to enhance it in two different ways: Upstream of the search engine, we aim at formulating more relevant queries in or- der to increase the precision of the top retrieved documents. To do so, we define a metric based on a co-occurrence graph and a random walk algorithm, which aims at predicting the topical relevance of a query. Downstream of the search engine, we filter the retrieved documents in order to improve the document collection quality. We do so by modeling our gathering process as a tripartite graph and applying a random walk with restart algorithm so as to simultaneously order by relevance the documents and terms appearing in our corpus.In the second part of this thesis, we turn to focused crawling. We describe our focused crawler implementation that was designed to scale horizontally. Then, we consider the problem of crawl frontier ordering, which is at the very heart of a focused crawler. Such ordering strategy allows the crawler to prioritize its fetches, maximizing the number of in-domain documents retrieved while minimizing the non relevant ones. We propose to apply learning to rank algorithms to efficiently order the crawl frontier, and define a method to learn a ranking function from existing crawls.
|
559 |
ValidAX - Validierung der Frameworks AMOPA und XTRIEVAL: Vorhaben im Rahmen des Programms Validierung des Innovationspotenzials wissenschaftlicher Forschung - VIP : SchlussberichtBerger, Arne, Eibl, Maximilian, Heinich, Stephan, Herms, Robert, Kahl, Stefan, Kürsten, Jens, Kurze, Albrecht, Manthey, Robert, Rickert, Markus, Ritter, Marc January 2015 (has links)
Das Projekt „ValidAX - Validierung der Frameworks AMOPA und XTRIEVAL“ untersucht die Möglichkeiten die an der Professur Medieninformatik der TU Chemnitz erstellten Softwareframeworks AMOPA (Automated Moving Picture Annotator) und Xtrieval (Extensible Information Retrieval Framework) in Richtung einer wirtschaftlichen Verwertbarkeit weiterzuentwickeln und in Arbeitsprozesse praktisch einzubinden. AMOPA ist in der Lage, beliebige audiovisuelle Medien zu analysieren und Metadaten wie Schnittgrenzen, Szenen, Personen, Audiotranskriptionen und andere durchzuführen. Xtrieval ist ein hochflexibles Werkzeug, welches die Recherche in beliebigen Medien optimal ermöglicht. Für die Durchführung des Projekts wurden insgesamt drei mögliche Einsatzszenarien definiert, in denen die Frameworks unterschiedlichen Anforderungen ausgesetzt waren:
- Archivierung
- Interaktives und automatisiertes Fernsehen
- Medizinische Videoanalysen
Entsprechend der Szenarien wurden die Frameworks optimiert und technische Workflows konzipiert und realisiert. Demonstratoren dienen zur Gewinnung weiterer Verwertungspartner.:I. Kurzdarstellung 2
1. Aufgabenstellung 2
2. Voraussetzungen, unter denen das Vorhaben durchgeführt wurde 3
3. Planung und Ablauf des Vorhabens 4
4. Wissenschaftlicher und technischer Stand, an den angeknüpft wurde 6
5. Zusammenarbeit mit anderen Stellen 7
II. Eingehende Darstellung 7
1. Verwendung der Zuwendung und des erzielten Ergebnisses im Einzelnen 7
AB 1: Flexible Mediatranskodierung für den Transport audiovisueller Medien 7
AB 2: Archivierungsstraße 13
AB 3: Workflowintegration 18
AB 4: Annotationsunterstützung 27
AB 5: Bilderkennung 33
AB 6: Web-Services 40
AB 7: Parallelverarbeitung 41
2. Wichtigste Positionen des zahlenmäßigen Nachweises 43
3. Notwendigkeit und Angemessenheit der geleisteten Arbeit 45
4. Voraussichtlicher Nutzen, insbesondere der Verwertbarkeit des Ergebnisses im Sinne des fortgeschriebenen Verwertungsplans 46
5. Während der Durchführung des Vorhabens dem ZE bekannt gewordenen Fortschritts auf dem Gebiet des Vorhabens bei anderen Stellen 48
6. Erfolgte oder geplante Veröffentlichungen der Ergebnisse 48 / The project "ValidAX - Validation of the frameworks AMOPA and XTRIEVAL" examines the possibilities of developing the software framework AMOPA (Automated Moving Picture Annotator) and Xtrieval (Extensible Information Retrieval Framework) towards a commercial usage. The frameworks have been created by the Chair Media Informatics at the TU Chemnitz. AMOPA is able to analyze any audiovisual media and to generate additional metadata such as scene detection, face detection, audio transcriptions and others. Xtrieval is a highly flexible tool that allows users to search in any media. For the implementation of the project a total of three possible scenarios have been defined, in which the frameworks were exposed to different requirements:
• Archiving
• Interactive and automated TV
• Medical video analysis
According to the scenarios, the frameworks were optimized and designed and technical workflows were conceptualized and implemented. Demonstrators are used to obtain further commercialization partner.:I. Kurzdarstellung 2
1. Aufgabenstellung 2
2. Voraussetzungen, unter denen das Vorhaben durchgeführt wurde 3
3. Planung und Ablauf des Vorhabens 4
4. Wissenschaftlicher und technischer Stand, an den angeknüpft wurde 6
5. Zusammenarbeit mit anderen Stellen 7
II. Eingehende Darstellung 7
1. Verwendung der Zuwendung und des erzielten Ergebnisses im Einzelnen 7
AB 1: Flexible Mediatranskodierung für den Transport audiovisueller Medien 7
AB 2: Archivierungsstraße 13
AB 3: Workflowintegration 18
AB 4: Annotationsunterstützung 27
AB 5: Bilderkennung 33
AB 6: Web-Services 40
AB 7: Parallelverarbeitung 41
2. Wichtigste Positionen des zahlenmäßigen Nachweises 43
3. Notwendigkeit und Angemessenheit der geleisteten Arbeit 45
4. Voraussichtlicher Nutzen, insbesondere der Verwertbarkeit des Ergebnisses im Sinne des fortgeschriebenen Verwertungsplans 46
5. Während der Durchführung des Vorhabens dem ZE bekannt gewordenen Fortschritts auf dem Gebiet des Vorhabens bei anderen Stellen 48
6. Erfolgte oder geplante Veröffentlichungen der Ergebnisse 48
|
560 |
Graphdatenbanken für die textorientierten e-HumanitiesEfer, Thomas 15 February 2017 (has links) (PDF)
Vor dem Hintergrund zahlreicher Digitalisierungsinitiativen befinden sich weite Teile der Geistes- und Sozialwissenschaften derzeit in einer Transition hin zur großflächigen Anwendung digitaler Methoden. Zwischen den Fachdisziplinen und der Informatik zeigen sich große Differenzen in der Methodik und bei der gemeinsamen Kommunikation. Diese durch interdisziplinäre Projektarbeit zu überbrücken, ist das zentrale Anliegen der sogenannten e-Humanities. Da Text der häufigste Untersuchungsgegenstand in diesem Feld ist, wurden bereits viele Verfahren des Text Mining auf Problemstellungen der Fächer angepasst und angewendet. Während sich langsam generelle Arbeitsabläufe und Best Practices etablieren, zeigt sich, dass generische Lösungen für spezifische Teilprobleme oftmals nicht geeignet sind. Um für diese Anwendungsfälle maßgeschneiderte digitale Werkzeuge erstellen zu können, ist eines der Kernprobleme die adäquate digitale Repräsentation von Text sowie seinen vielen Kontexten und Bezügen.
In dieser Arbeit wird eine neue Form der Textrepräsentation vorgestellt, die auf Property-Graph-Datenbanken beruht – einer aktuellen Technologie für die Speicherung und Abfrage hochverknüpfter Daten. Darauf aufbauend wird das Textrecherchesystem „Kadmos“ vorgestellt, mit welchem nutzerdefinierte asynchrone Webservices erstellt werden können. Es bietet flexible Möglichkeiten zur Erweiterung des Datenmodells und der Programmfunktionalität und kann Textsammlungen mit mehreren hundert Millionen Wörtern auf einzelnen Rechnern und weitaus größere in Rechnerclustern speichern. Es wird gezeigt, wie verschiedene Text-Mining-Verfahren über diese Graphrepräsentation realisiert und an sie angepasst werden können. Die feine Granularität der Zugriffsebene erlaubt die Erstellung passender Werkzeuge für spezifische fachwissenschaftliche Anwendungen. Zusätzlich wird demonstriert, wie die graphbasierte Modellierung auch über die rein textorientierte Forschung hinaus gewinnbringend eingesetzt werden kann. / In light of the recent massive digitization efforts, most of the humanities disciplines are currently undergoing a fundamental transition towards the widespread application of digital methods. In between those traditional scholarly fields and computer science exists a methodological and communicational gap, that the so-called \\\"e-Humanities\\\" aim to bridge systematically, via interdisciplinary project work. With text being the most common object of study in this field, many approaches from the area of Text Mining have been adapted to problems of the disciplines. While common workflows and best practices slowly emerge, it is evident that generic solutions are no ultimate fit for many specific application scenarios. To be able to create custom-tailored digital tools, one of the central issues is to digitally represent the text, as well as its many contexts and related objects of interest in an adequate manner.
This thesis introduces a novel form of text representation that is based on Property Graph databases – an emerging technology that is used to store and query highly interconnected data sets. Based on this modeling paradigm, a new text research system called \\\"Kadmos\\\" is introduced. It provides user-definable asynchronous web services and is built to allow for a flexible extension of the data model and system functionality within a prototype-driven development process. With Kadmos it is possible to easily scale up to text collections containing hundreds of millions of words on a single device and even further when using a machine cluster. It is shown how various methods of Text Mining can be implemented with and adapted for the graph representation at a very fine granularity level, allowing the creation of fitting digital tools for different aspects of scholarly work. In extended usage scenarios it is demonstrated how the graph-based modeling of domain data can be beneficial even in research scenarios that go beyond a purely text-based study.
|
Page generated in 0.1147 seconds