Spelling suggestions: "subject:"forminformation retrieval"" "subject:"informationation retrieval""
371 |
Searching by browsingCox, Kevin Ross, n/a January 1994 (has links)
Information retrieval (IR) is an important part of many tasks performed by people when they
use computers. However, most IR research and theory isolates the IR component from the
tasks performed by users. This is done by expressing user needs as a query performed on a
database. In contrast this dissertation investigates the design and evaluation of information
retrieval systems where the information retrieval mechanisms remain embedded in the user
tasks.
While there are a many different types of user tasks performed with computers we can specify
common requirements for the IR needed in most tasks. There are both user interface and
machine processing requirements. For user interfaces it is desirable if users interact directly
with information databases, keep control of the interaction and are able to perform IR in a
timely manner. Machine processing has to be within the capabilities of machines yet must fit
with human perceptions and has to be efficient in both storage and computation.
Given the overall requirements, the dissertation gives a particular implementation for how to
embed IR in tasks. The implementation uses a vector representation for objects and organises
the objects in a near neighbour data structure. Near neighbours are defined within the context
of the tasks the users wish to achieve. While the implementation could use many different
finding mechanisms, it emphasises a constructive solution building approach with localised
browsing in the database. It is shown how the IR implementation fits with the overall task
activities of the user.
Much of the dissertation examines how to evaluate embedded IR. Embedded IR requires
testing users' task performance in both real experiments and thought experiments.
Implementation is tested by finding known objects, by validating the machine representations
and their correspondence with human perceptions and by testing the machine performance of
the implementation.
Finally implications and extensions of the work arc explored by looking at the practicality of
the approach, other methods of investigation and the possibility of building dynamic learning
systems that improve with use.
|
372 |
Search Engine Optimisation Using Past QueriesGarcia, Steven, steven.garcia@student.rmit.edu.au January 2008 (has links)
World Wide Web search engines process millions of queries per day from users all over the world. Efficient query evaluation is achieved through the use of an inverted index, where, for each word in the collection the index maintains a list of the documents in which the word occurs. Query processing may also require access to document specific statistics, such as document length; access to word statistics, such as the number of unique documents in which a word occurs; and collection specific statistics, such as the number of documents in the collection. The index maintains individual data structures for each these sources of information, and repeatedly accesses each to process a query. A by-product of a web search engine is a list of all queries entered into the engine: a query log. Analyses of query logs have shown repetition of query terms in the requests made to the search system. In this work we explore techniques that take advantage of the repetition of user queries to improve the accuracy or efficiency of text search. We introduce an index organisation scheme that favours those documents that are most frequently requested by users and show that, in combination with early termination heuristics, query processing time can be dramatically reduced without reducing the accuracy of the search results. We examine the stability of such an ordering and show that an index based on as little as 100,000 training queries can support at least 20 million requests. We show the correlation between frequently accessed documents and relevance, and attempt to exploit the demonstrated relationship to improve search effectiveness. Finally, we deconstruct the search process to show that query time redundancy can be exploited at various levels of the search process. We develop a model that illustrates the improvements that can be achieved in query processing time by caching different components of a search system. This model is then validated by simulation using a document collection and query log. Results on our test data show that a well-designed cache can reduce disk activity by more than 30%, with a cache that is one tenth the size of the collection.
|
373 |
Federated Text Retrieval from Independent CollectionsShokouhi, Milad, milads@microsoft.com January 2008 (has links)
Federated information retrieval is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot index uncrawlable hidden web collections; federated information retrieval systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated information retrieval systems acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. In this thesis, we propose new approaches for each of these problems. Our suggested methods, for collection representation, collection selection, and result merging, outperform state-of-the-art techniques in most cases. We also propose novel methods for estimating the number of documents in collections, and for pruning unnecessary information from collection representations sets. Although management of document duplication has been cited as one of the major problems in federated search, prior research in this area often assumes that collections are free of overlap. We investigate the effectiveness of federated search on overlapped collections, and propose new methods for maximizing the number of distinct relevant documents in the final merged results. In summary, this thesis introduces several new contributions to the field of federated information retrieval, including practical solutions to some historically unsolved problems in federated search, such as document duplication management. We test our techniques on multiple testbeds that simulate both hidden web and enterprise search environments.
|
374 |
Improved Cross-language Information Retrieval via Disambiguation and Vocabulary DiscoveryZhang, Ying, ying.yzhang@gmail.com January 2007 (has links)
Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR.
|
375 |
An architecture for a function driven mechanical design solution libraryWood, Stephen L. (Stephen Lathrop) 01 September 1994 (has links)
The engineering design process and the advancement of future computer-aided-design systems need new design aids to be used during the conceptual design
phase. This design phase is where information gathering and understanding of the
problem are developed, analyzed and broken into small more manageable elements.
These elements consist of customer requirements and engineering specifications,
many of which are converted into functional expressions that need to be satisfied. Of
these elements, it is at the most basic level of the functional expression that the
beginning form of a product is developed. Upon that initial form, consisting of the
basic envelope (area domain) of the product and defined by form features,
components and assemblies are added to fulfill the functional requirements of the
product.
This dissertation develops the architecture of a Function Driven Mechanical
Design Solution Library for the most primitive design structure - the feature. Each
feature has functional expressions associated with it, which represent the fundamental
information about the structure. The implementation uses the functionality a feature
inherently possesses to obtain solutions. By using a feature's functionality for the
search criteria during the design of mechanical components, the design engineer has
access to a wider variety of design solutions than traditional libraries are capable of
finding, and a more in depth understanding of the design. / Graduation date: 1995
|
376 |
A Web-based Question Answering SystemZhang, Dell, Lee, Wee Sun 01 1900 (has links)
The Web is apparently an ideal source of answers to a large variety of questions, due to the tremendous amount of information available online. This paper describes a Web-based question answering system LAMP, which is publicly accessible. A particular characteristic of this system is that it only takes advantage of the snippets in the search results returned by a search engine like Google. We think such “snippet-tolerant” property is important for an online question answering system to be practical, because it is time-consuming to download and analyze the original web documents. The performance of LAMP is comparable to the best state-of-the-art question answering systems. / Singapore-MIT Alliance (SMA)
|
377 |
TuliP a teacher's tool for lesson planning /Reed, R. Gabrielle. Hawkes, Lois Wright. January 2002 (has links)
Thesis (M.S.)--Florida State University, 2003. / Advisor: Dr. Lois Wright Hawkes, Florida State University, College of Arts and Sciences, Dept. of Computer Science. Title and description from dissertation home page (viewed Sept. 25, 2003). Includes bibliographical references.
|
378 |
Mining User-generated Content for InsightsAngel, Albert-David 20 August 2012 (has links)
The proliferation of social media, such as blogs, micro-blogs and social networks, has led to a plethora of readily available user-generated content. The latter offers a unique, uncensored window into emerging stories and events, ranging from politics and revolutions to product perception and the zeitgeist.
Importantly, structured information is available for user-generated content, by dint of its metadata, or can be surfaced via recently commoditized information extraction tools. This wealth of information, in the form of real-world entities and facts mentioned in a document, author demographics, and so on, provides exciting opportunities for mining insights from this content.
Capitalizing upon these, we develop Grapevine, an online system that distills information from the social media collective on a daily basis, and facilitates its interactive exploration. To further this goal, we address important research problems, which are also of independent interest. The sheer scale of the data being processed, necessitates that our solutions be highly efficient.
We propose efficient techniques for mining important stories, on a per-user-demographic basis, based on named entity co-occurrences in user-generated content. Building upon these, we propose efficient techniques for identifying emerging stories as-they-happen, by identifying dense structures in an evolving entity graph.
To facilitate the exploration of these stories, we propose efficient techniques for filtering them, based on users’ textual descriptions of the entities involved.
These gathered insights need to be presented to users in a useful manner, via a diverse set of representative documents; we thus propose efficient techniques for addressing this problem.
Recommending related stories to users is important for navigation purposes. As the way in which these are related to the story being explored is not always clear, we propose efficient techniques for generating recommendation explanations via entity relatedness queries.
|
379 |
Mining User-generated Content for InsightsAngel, Albert-David 20 August 2012 (has links)
The proliferation of social media, such as blogs, micro-blogs and social networks, has led to a plethora of readily available user-generated content. The latter offers a unique, uncensored window into emerging stories and events, ranging from politics and revolutions to product perception and the zeitgeist.
Importantly, structured information is available for user-generated content, by dint of its metadata, or can be surfaced via recently commoditized information extraction tools. This wealth of information, in the form of real-world entities and facts mentioned in a document, author demographics, and so on, provides exciting opportunities for mining insights from this content.
Capitalizing upon these, we develop Grapevine, an online system that distills information from the social media collective on a daily basis, and facilitates its interactive exploration. To further this goal, we address important research problems, which are also of independent interest. The sheer scale of the data being processed, necessitates that our solutions be highly efficient.
We propose efficient techniques for mining important stories, on a per-user-demographic basis, based on named entity co-occurrences in user-generated content. Building upon these, we propose efficient techniques for identifying emerging stories as-they-happen, by identifying dense structures in an evolving entity graph.
To facilitate the exploration of these stories, we propose efficient techniques for filtering them, based on users’ textual descriptions of the entities involved.
These gathered insights need to be presented to users in a useful manner, via a diverse set of representative documents; we thus propose efficient techniques for addressing this problem.
Recommending related stories to users is important for navigation purposes. As the way in which these are related to the story being explored is not always clear, we propose efficient techniques for generating recommendation explanations via entity relatedness queries.
|
380 |
An n-gram Based Approach to the Automatic Classification of Web Pages by GenreMason, Jane E. 10 December 2009 (has links)
The extraordinary growth in both the size and popularity of the World Wide Web has generated a growing interest in the identification of Web page genres, and in the use of these genres to classify Web pages. Web page genre classification is a potentially powerful tool for filtering the results of online searches. Although most information retrieval searches are topic-based, users are typically looking for a specific type of information with regard to a particular query, and genre can provide a complementary dimension along which to categorize Web pages. Web page genre classification could also aid in the automated summarization and indexing of Web pages, and in improving the automatic extraction of metadata.
The hypothesis of this thesis is that a byte n-gram representation of a Web page can be used effectively to classify the Web page by its genre(s). The goal of this thesis was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, which better represent a real world environment. This thesis research develops n-gram representations for Web pages and Web page genres, and based on these representations, a new approach to the classification of Web pages by genre is developed.
The research includes an exhaustive examination of the questions associated with developing the new classification model, including the length, number, and type of the n-grams with which each Web page and Web page genre is represented, the method of computing the distance (dissimilarity) between two n-gram representations, and the feature selection method with which to choose these n-grams. The effect of preprocessing the data is also studied. Techniques for setting genre thresholds in order to allow a Web page to belong to more than one genre, or to no genre at all are also investigated, and a comparison of the classification performance of the new classification model with that of the popular support vector machine approach is made. Experiments are also conducted on highly unbalanced corpora, both with and without the inclusion of noise Web pages.
|
Page generated in 0.1265 seconds