Global ETD Search

361	Semiautomatische Metadaten-Extraktion und Qualitätsmanagement in Workflow-Systemen zur Digitalisierung historischer Dokumente / Semi-automated Metadata Extraction and Quality Management in Workflow Systems for Digitizations of Early Documents Schöneberg, Hendrik January 2014 (has links) (PDF) Performing Named Entity Recognition on ancient documents is a time-consuming, complex and error-prone manual task. It is a prerequisite though to being able to identify related documents and correlate between named entities in distinct sources, helping to precisely recreate historic events. In order to reduce the manual effort, automated classification approaches could be leveraged. Classifying terms in ancient documents in an automated manner poses a difficult task due to the sources’ challenging syntax and poor conservation states. This thesis introduces and evaluates approaches that can cope with complex syntactial environments by using statistical information derived from a term’s context and combining it with domain-specific heuristic knowledge to perform a classification. Furthermore this thesis demonstrates how metadata generated by these approaches can be used as error heuristics to greatly improve the performance of workflow systems for digitizations of early documents. / Die Extraktion von Metadaten aus historischen Dokumenten ist eine zeitintensive, komplexe und höchst fehleranfällige Tätigkeit, die üblicherweise vom menschlichen Experten übernommen werden muss. Sie ist jedoch notwendig, um Bezüge zwischen Dokumenten herzustellen, Suchanfragen zu historischen Ereignissen korrekt zu beantworten oder semantische Verknüpfungen aufzubauen. Um den manuellen Aufwand dieser Aufgabe reduzieren zu können, sollen Verfahren der Named Entity Recognition angewendet werden. Die Klassifikation von Termen in historischen Handschriften stellt jedoch eine große Herausforderung dar, da die Domäne eine hohe Schreibweisenvarianz durch unter anderem nur konventionell vereinbarte Orthographie mit sich bringt. Diese Arbeit stellt Verfahren vor, die auch in komplexen syntaktischen Umgebungen arbeiten können, indem sie auf Informationen aus dem Kontext der zu klassifizierenden Terme zurückgreifen und diese mit domänenspezifischen Heuristiken kombinieren. Weiterhin wird evaluiert, wie die so gewonnenen Metadaten genutzt werden können, um in Workflow-Systemen zur Digitalisierung historischer Handschriften Mehrwerte durch Heuristiken zur Produktionsfehlererkennung zu erzielen. Klassifikation Information Retrieval Text Mining Arbeitsablaufplanung Data Mining ddc:000
362	Privacy aware social information retrieval and spam filtering using folksonomies / Suche und Spam Entdeckung anhand von Folksonomien unter Beachtung datenschutzrelevanter Aspekte Navarro Bullock, Beate January 2015 (has links) (PDF) Social interactions as introduced by Web 2.0 applications during the last decade have changed the way the Internet is used. Today, it is part of our daily lives to maintain contacts through social networks, to comment on the latest developments in microblogging services or to save and share information snippets such as photos or bookmarks online. Social bookmarking systems are part of this development. Users can share links to interesting web pages by publishing bookmarks and providing descriptive keywords for them. The structure which evolves from the collection of annotated bookmarks is called a folksonomy. The sharing of interesting and relevant posts enables new ways of retrieving information from the Web. Users can search or browse the folksonomy looking at resources related to specific tags or users. Ranking methods known from search engines have been adjusted to facilitate retrieval in social bookmarking systems. Hence, social bookmarking systems have become an alternative or addendum to search engines. In order to better understand the commonalities and differences of social bookmarking systems and search engines, this thesis compares several aspects of the two systems' structure, usage behaviour and content. This includes the use of tags and query terms, the composition of the document collections and the rankings of bookmarks and search engine URLs. Searchers (recorded via session ids), their search terms and the clicked on URLs can be extracted from a search engine query logfile. They form similar links as can be found in folksonomies where a user annotates a resource with tags. We use this analogy to build a tripartite hypergraph from query logfiles (a logsonomy), and compare structural and semantic properties of log- and folksonomies. Overall, we have found similar behavioural, structural and semantic characteristics in both systems. Driven by this insight, we investigate, if folksonomy data can be of use in web information retrieval in a similar way to query log data: we construct training data from query logs and a folksonomy to build models for a learning-to-rank algorithm. First experiments show a positive correlation of ranking results generated from the ranking models of both systems. The research is based on various data collections from the social bookmarking systems BibSonomy and Delicious, Microsoft's search engine MSN (now Bing) and Google data. To maintain social bookmarking systems as a good source for information retrieval, providers need to fight spam. This thesis introduces and analyses different features derived from the specific characteristics of social bookmarking systems to be used in spam detection classification algorithms. Best results can be derived from a combination of profile, activity, semantic and location-based features. Based on the experiments, a spam detection framework which identifies and eliminates spam activities for the social bookmarking system BibSonomy has been developed. The storing and publication of user-related bookmarks and profile information raises questions about user data privacy. What kinds of personal information is collected and how do systems handle user-related items? In order to answer these questions, the thesis looks into the handling of data privacy in the social bookmarking system BibSonomy. Legal guidelines about how to deal with the private data collected and processed in social bookmarking systems are also presented. Experiments will show that the consideration of user data privacy in the process of feature design can be a first step towards strengthening data privacy. / Soziale Interaktion, wie sie im letzten Jahrzehnt durch Web 2.0 Anwendungen eingeführt wurde, änderte die Art und Weise wie wir das Internet nutzen. Heute gehört es zum Alltag, Kontakte in sozialen Netzwerken zu pflegen, die aktuellsten Entwicklungen in Mikroblogging - Anwendungen zu kommentieren, oder interessante Informationen wie Fotos oder Weblinks digital zu speichern und zu teilen. Soziale Lesezeichensysteme sind ein Teil dieser Entwicklung. Nutzer können Links zu interessanten Webseiten teilen, indem sie diese mit aussagekräftigen Begriffen (Tags) versehen und veröffentlichen. Die Struktur, die aus der Sammlung von annotierten Lesezeichen entsteht, wird Folksonomy genannt. Nutzer können diese durchforsten und nach Links mit bestimmten Tags oder von bestimmten Nutzern suchen. Ranking Methoden, die schon in Suchmaschinen implementiert wurden, wurden angepasst, um die Suche in sozialen Lesezeichensystemen zu erleichtern. So haben sich diese Systeme mittlerweile zu einer ernsthaften Alternative oder Ergänzung zu traditionellen Suchmaschinen entwickelt. Um Gemeinsamkeiten und Unterschiede in der Struktur, Nutzung und in den Inhalten von sozialen Lesezeichensystemen und Suchmaschinen besser zu verstehen, werden in dieser Arbeit die Verwendung von Tags und Suchbegriffen, die Zusammensetzung der Dokumentensammlungen und der Aufbau der Rankings verglichen und diskutiert. Aus den Suchmaschinennutzern eines Logfiles, ihren Anfragen und den geklickten Rankingergebnissen lässt sich eine ähnlich tripartite Struktur wie die der Folksonomy aufbauen. Die Häufigkeitsverteilungen sowie strukturellen Eigenschaften dieses Graphen werden mit der Struktur einer Folksonomy verglichen. Insgesamt lassen sich ein ähnliches Nutzerverhalten und ähnliche Strukturen aus beiden Ansätzen ableiten. Diese Erkenntnis nutzend werden im letzten Schritt der Untersuchung Trainings- und Testdaten aus Suchmaschinenlogfiles und Folksonomien generiert und ein Rankingalgorithmus trainiert. Erste Analysen ergeben, dass die Rankings generiert aus impliziten Feedback von Suchmaschinen und Folksonomien, positiv korreliert sind. Die Untersuchungen basieren auf verschiedenen Datensammlungen aus den sozialen Lesezeichensystemen BibSonomy und Delicious, und aus Daten der Suchmaschinen MSN (jetzt Bing) und Google. Damit soziale Lesezeichensysteme als qualitativ hochwertige Informationssysteme erhalten bleiben, müssen Anbieter den in den Systemen anfallenden Spam bekämpfen. In dieser Arbeit werden verschiedene Merkmale vom legitimen und nicht legitimen Nutzern aus den Besonderheiten von Folksonomien abgeleitet und auf ihre Eignung zur Spamentdeckung getestet. Die besten Ergebnisse ergeben eine Kombination aus Profil- Aktivitäts-, semantischen und ortsbezogenen Merkmalen. Basierend auf den Experimenten wird eine Spamentdeckungsanwendung entwickelt mit Hilfe derer Spam in sozialen Lesezeichensystem BibSonomy erkannt und eliminiert wird. Mit der Speicherung und Veröffentlichung von benutzerbezogenen Daten ergibt sich die Frage, ob die persönlichen Daten eines Nutzers in sozialen Lesezeichensystemen noch genügend geschützt werden. Welche Art der persönlichen Daten werden in diesen Systemen gesammelt und wie gehen existierende Systeme mit diesen Daten um? Um diese Fragen zu beantworten, wird die Anwendung BibSonomy unter technischen und datenschutzrechtlichen Gesichtspunkten analysiert. Es werden Richtlinien erarbeitet, die als Leitfaden für den Umgang mit persönlichen Daten bei der Entwicklung und dem Betrieb von sozialen Lesezeichen dienen sollen. Experimente zur Spamklassifikation zeigen, dass die Berücksichtigung von datenschutzrechtlichen Aspekten bei der Auswahl von Klassifikationsmerkmalen persönliche Daten schützen können, ohne die Performanz des Systems bedeutend zu verringern. Information Retrieval Data Mining Soziales Netzwerk ddc:000
363	Low level structures in the implementation of the relational algebra Otoo, Ekow J. January 1983 (has links) No description available. Database management. Data structures (Computer science)
364	Doctoral students’ mental models of a web search engine : an exploratory study Li, Ping, 1965- January 2007 (has links) No description available. Google College students -- Psychology. Search engines.
365	The construction of student pathways during information-seeking sessions using hypermedia programs : a social semiotic perspective Zammit, Katina, University of Western Sydney, College of Arts, School of Humanities and Languages January 2007 (has links) The thesis extends the use of systemic functional linguistics (SFL) to describe and analyse the semiotic systems beyond language by providing a detailed and systematic approach to the description of multimodal hypertext systems. The thesis uses a social semiotic approach to the text in order to develop an analytical framework for the description of hypertext through the two dimensions of rank and metafunction. This approach is employed to describe, assess and evaluate the pathways that student user groups construct using hypertext resources during a task-based information search session. The resources realised at the ranks of element, screen and pathway are described across four metafunctions: Representational, Interactive, Compositional and Logical. The data which forms the basis of the thesis was collected from a Year 4-5-6 classroom in a primary school in Sydney, Australia. The substantive contributions of this thesis detail the resources of the hypertext analytical framework. / Doctor of Philosophy (PhD) information literacy hypertext systems functionalism (Linguistics) information retrieval
366	A Novel Concept and Context-Based Approach for Web Information Retrieval Zakos, John, n/a January 2005 (has links) Web information retrieval is a relatively new research area that has attracted a significant amount of interest from researchers around the world since the emergence of the World Wide Web in the early 1990s. The problems facing successful web information retrieval are a combination of challenges that stem from traditional information retrieval and challenges characterised by the nature of the World Wide Web. The goal of any information retrieval system is to provide an information need fulfilment in response to an information need. In a web setting, this means retrieving as many relevant web documents as possible in response to an inputted query that is typically limited to only containing a few terms expressive of the user's information need. This thesis is primarily concerned with firstly reviewing pertinent literature related to various aspects of web information retrieval research and secondly proposing and investigating a novel concept and context-based approach. The approach consists of techniques that can be used together or independently and aim to provide an improvement in retrieval accuracy over other approaches. A novel concept-based term weighting technique is proposed as a new method of deriving query term significance from ontologies that can be used for the weighting of inputted queries. A technique that dynamically determines the significance of terms occurring in documents based on the matching of contexts is also proposed. Other contributions of this research include techniques for the combination of document and query term weights for the ranking of retrieved documents. All techniques were implemented and tested on benchmark data. This provides a basis for performing comparison with previous top performing web information retrieval systems. High retrieval accuracy is reported as a result of utilising the proposed approach. This is supported through comprehensive experimental evidence and favourable comparisons against previously published results. Web information retrieval World Wide Web retrieval accuracy weighting technique
367	Visualisation in mining documents for retrieval using self organising maps Tan, Hiong Sen January 2005 (has links) This dissertation presents a study of creating maps that can be used to help people seek information from Internet documents. The study involves several different research areas in computer and information science including web mining, data mining, artificial neural network in particular self organising maps (SOM), information visualisation, user interface and information retrieval. The purpose of this dissertation is to offer an alternative way to retrieve information by visually representing the characteristics of the unseen documents and their relationships on the 2-dimensional surface of the SOM. The process starts with collecting documents that include text and images from the Internet, moving to extracting important features from them. In other words, we are performing an information retrieval indexing process. The document features are then clustered by using the SOM. As a result, documents with similar features will be clustered together on 2-dimensional maps. The maps are labelled and the documents are connected to locations on the maps based on the labels. The maps are then arranged hierarchically and visualised so that they can be used as a browsing and exploration tool for information retrieval. Data mining Map reading Information management Information retrieval
368	Efficient Query Expansion Billerbeck, Bodo, bodob@cs.rmit.edu.au January 2006 (has links) Hundreds of millions of users each day search the web and other repositories to meet their information needs. However, queries can fail to find documents due to a mismatch in terminology. Query expansion seeks to address this problem by automatically adding terms from highly ranked documents to the query. While query expansion has been shown to be effective at improving query performance, the gain in effectiveness comes at a cost: expansion is slow and resource-intensive. Current techniques for query expansion use fixed values for key parameters, determined by tuning on test collections. We show that these parameters may not be generally applicable, and, more significantly, that the assumption that the same parameter settings can be used for all queries is invalid. Using detailed experiments, we demonstrate that new methods for choosing parameters must be found. In conventional approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run. We demonstrate a new method of obtaining expansion terms, based on past user queries that are associated with documents in the collection. The most effective query expansion methods rely on costly retrieval and processing of feedback documents. We explore alternative methods for reducing query-evaluation costs, and propose a new method based on keeping a brief summary of each document in memory. This method allows query expansion to proceed three times faster than previously, while approximating the effectiveness of standard expansion. We investigate the use of document expansion, in which documents are augmented with related terms extracted from the corpus during indexing, as an alternative to query expansion. The overheads at query time are small. We propose and explore a range of corpus-based document expansion techniques and compare them to corpus-based query expansion on TREC data. These experiments show that document expansion delivers at best limited benefits, while query expansion ï¿½ including standard techniques and efficient approaches described in recent work ï¿½ usually delivers good gains. We conclude that document expansion is unpromising, but it is likely that the efficiency of query expansion can be further improved. information retrieval query expansion pseudo relevance feedback efficiency
369	Efficient Query Expansion Billerbeck, Bodo, bodob@cs.rmit.edu.au January 2006 (has links) Hundreds of millions of users each day search the web and other repositories to meet their information needs. However, queries can fail to find documents due to a mismatch in terminology. Query expansion seeks to address this problem by automatically adding terms from highly ranked documents to the query. While query expansion has been shown to be effective at improving query performance, the gain in effectiveness comes at a cost: expansion is slow and resource-intensive. Current techniques for query expansion use fixed values for key parameters, determined by tuning on test collections. We show that these parameters may not be generally applicable, and, more significantly, that the assumption that the same parameter settings can be used for all queries is invalid. Using detailed experiments, we demonstrate that new methods for choosing parameters must be found. In conventional approaches to query expansion, the additional terms are selected from highly ranked do cuments returned from an initial retrieval run. We demonstrate a new method of obtaining expansion terms, based on past user queries that are associated with documents in the collection. The most effective query expansion methods rely on costly retrieval and processing of feedback documents. We explore alternative methods for reducing query-evaluation costs, and propose a new method based on keeping a brief summary of each document in memory. This method allows query expansion to proceed three times faster than previously, while approximating the effectiveness of standard expansion. We investigate the use of document expansion, in which documents are augmented with related terms extracted from the corpus during indexing, as an alternative to query expansion. The overheads at query time are small. We propose and explore a range of corpus-based document expansion techniques and compare them to corpus-based query expansion on TREC data. These experiments show that document expansion delivers at best limited ben efits, while query expansion, including standard techniques and efficient approaches described in recent work, usually delivers good gains. We conclude that document expansion is unpromising, but it is likely that the efficiency of query expansion can be further improved. information retrieval query expansion pseudo relevance feedback efficiency
370	Searching and ranking structured documents Trotman, Andrew, n/a January 2007 (has links) It is common to see documents with explicit structure marked up in languages such as XML. Queries, on the other hand, typically have no structure. There is a clear mismatch, although documents contain structure it is typically not used in information retrieval. An efficient index structure for document-centric searching is proposed and its efficiency is discussed. It is shown to be at worst linear with respect to the number of occurrences of a given search term. The algorithm is then extended to accommodate element-centric information retrieval. Ranking algorithms for structured documents are examined. Genetic Algorithms are used to learn different weights for each structure present in a document. Applying these weights as part of a function is shown to yield significant precision improvements in some functions. Genetic Programming is then used to learn an entire ranking function. This function is shown to be portable between document collections. A query language for structured information retrieval is proposed. Use of this language in the 2004 INEX workshop resulted in a large decrease in query errors. Structured information retrieval is now a viable alternative to its unstructured counterpart. A successful query language, efficient indexing structures, and improved ranking functions are all presented. information retrieval query languages (computer science) algorithms computer programs

Search results