Spelling suggestions: "subject:"forminformation retrieval"" "subject:"forminformation etrieval""
401 |
Improving library searches using word-correlation factors and folksonomies /Pera, Maria Soledad, January 2009 (has links) (PDF)
Thesis (M.S.)--Brigham Young University. Dept. of Computer Science, 2009. / Includes bibliographical references (p. 76-81).
|
402 |
Query rewriting for extracting data behind HTML forms /Chen, Xueqi, January 2004 (has links)
Thesis (M.S.)--Brigham Young University. Dept. of Computer Science, 2004. / Includes bibliographical references (p. 47-50).
|
403 |
Secure object spaces for global information retrieval (SOSGIR) /Cheung, Yee-him. January 2000 (has links)
Thesis (M. Phil.)--University of Hong Kong, 2001. / Includes bibliographical references (leaves 90-91).
|
404 |
WebDoc an automated Web document indexing system /Tang, Bo. January 2002 (has links)
Thesis (M.S.)--Mississippi State University. Department of Computer Science. / Title from title screen. Includes bibliographical references.
|
405 |
Schema matching and data extraction over HTML tables /Tao, Cui, January 2003 (has links) (PDF)
Thesis (M.S.)--Brigham Young University. Dept. of Computer Science, 2003. / Includes bibliographical references (p. 51-56).
|
406 |
Supervised language models for temporal resolution of text in absence of explicit temporal cuesKumar, Abhimanu 18 March 2014 (has links)
This thesis explores the temporal analysis of text using the implicit temporal cues
present in document. We consider the case when all explicit temporal expressions such as
specific dates or years are removed from the text and a bag of words based approach is used
for timestamp prediction for the text. A set of gold standard text documents with times-
tamps are used as the training set. We also predict time spans for Wikipedia biographies
based on their text. We have training texts from 3800 BC to present day. We partition this
timeline into equal sized chronons and build a probability histogram for a test document
over this chronon sequence. The document is assigned to the chronon with the highest
probability.
We use 2 approaches: 1) a generative language model with Bayesian priors, and 2) a
KL divergence based model. To counter the sparsity in the documents and chronons we use
3 different smoothing techniques across models. We use 3 diverse datasets to test our mod-
els: 1) Wikipedia Biographies, 2) Guttenberg Short Stories, and 3) Wikipedia Years dataset.
Our models are trained on a subset of Wikipedia biographies. We concentrate on
two prediction tasks: 1) time-stamp prediction for a generic text or mid-span prediction for
a Wikipedia biography , and 2) life-span prediction for a Wikipedia biography. We achieve
an f-score of 81.1% for life-span prediction task and a mean error of around 36 years for
mid-span prediction for biographies from present day to 3800 BC. The best model gives a
mean error of 18 years for publication date prediction for short stories that are uniformly
distributed in the range 1700 AD to 2010 AD. Our models exploit the temporal distribu-
tion of text for associating time. Our error analysis reveals interesting properties about the
models and datasets used.
We try to combine explicit temporal cues extracted from the document with its
implicit cues and obtain combined prediction model. We show that a combination of the
date-based predictions and language model divergence predictions is highly effective for this
task: our best model obtains an f-score of 81.1% and the median error between actual and
predicted life span midpoints is 6 years. This would be one of the emphasis for our future
work.
The above analyses demonstrates that there are strong temporal cues within texts
that can be exploited statistically for temporal predictions. We also create good benchmark
datasets along the way for the research community to further explore this problem. / text
|
407 |
Similarity search with earth mover's distance at scaleTang, Yu, 唐宇 January 2013 (has links)
Earth Mover's Distance (EMD), as a similarity measure, has received a lot of attention in the fields of multimedia and probabilistic databases, computer vision, image retrieval, machine learning, etc. EMD on multidimensional histograms provides better distinguishability between the objects approximated by the histograms (e.g., images), compared to classic measures like Euclidean distance. Despite its usefulness, EMD has a high computational cost; therefore, a number of effective filtering methods have been proposed, to reduce the pairs of histograms for which the exact EMD has to be computed, during similarity search. Still, EMD calculations in the refinement step remain the bottleneck of the whole similarity search process. In this thesis, we focus on optimizing the refinement phase of EMD-based similarity search by (i) adapting an efficient min-cost flow algorithm (SIA) for the EMD computation, (ii) proposing a dynamic distance bound, which is progressively updated and tightened during the refinement process and can be used to terminate an EMD refinement early, and (iii) proposing a dynamic refinement order for the candidates which, paired with a concurrent EMD refinement strategy, reduces the amount of needless computations. Our proposed techniques are orthogonal to and can be easily integrated with the state-of-the-art filtering techniques, reducing the cost of EMD-based similarity queries by orders of magnitude. / published_or_final_version / Computer Science / Master / Master of Philosophy
|
408 |
Νέες τεχνικές αξιολόγησης ανάκτησης πληροφορίας / New techniques in evaluating information retrievalΕυαγγελόπουλος, Ξενοφών 27 May 2015 (has links)
Η Ανάκτηση πληροφορίας αποτελεί μια αρκετά σημαντική επιστημονική περιοχή της επιστήμης των υπολογιστών που αποσκοπεί στην συγκέντρωση τεράστιων ποσών αδόμητης πληροφορίας (συνήθως κείμενο) απο μεγάλες συλλογές κειμένων, σύμφωνα με μια πληροφοριακή ανάγκη ενός χρήστη. Τα τελευταία χρόνια, ενα βασικό κομμάτι της ανάκτησης πληροφορίας επικεντρώνεται στην αξιολόγηση της διαδικασίας ανάκτησης αυτής καθ'αυτής. Έτσι, τα τελευταία χρόνια έχουν αναπτυχθεί αρκετές μετρικές αξιολόγησης, αλλά και μοντέλα χρηστών που προσπαθούν να αξιολογήσουν και να μοντελοποιήσουν, όσο το δυνατόν καλύτερα την συμπεριφορά ενός χρήστη κατα την διάρκεια της αναζήτησης.
Σε αυτήν την διπλωματική εργασία προτείνουμε μια νέα μετρική αξιολόγησης για την ανάκτηση πληροφοριών, η οποία αποσκοπεί στην καλύτερη δυνατή αξιολόγηση απο την πλευρά της συμπεριφοράς του χρήστη. Μια συνηθισμένη μέθοδος για τον προσδιορισμό της σχετικότητας ενός κειμένου, είναι η χρήση αξιολογήσων απο ειδικούς, οι οποίοι είναι εκπαιδευμένοι στον να διακρίνουν εάν ενα κείμενο είναι σχετικό με βάση κάποιο ερώτημα. Ωστόσο, οι αξιολογήσεις αυτές δεν αντανακλούν πάντοτε τις απόψεις όλων των χρηστών, παρα μόνο μιας μερίδας αυτών. Η δική μας μετρική, εισάγη μια νέα έννοια, αυτήν της "δημοφιλίας" για ένα κείμενο/ιστοσελίδα, η οποία μπορεί να θεωρηθεί ως η άποψη κάθε χρήστη για μια ιστοσελίδα. Έτσι, εισάγoντας εναν γραμμικό συνδυασμό απο αξιολογήσεις ειδικών , αλλά και "απόψεις δημοφιλίας" απο τους χρήστες, οδηγούμαστε σε μια μετρική η οποία εξηγεί καλύτερα την συμπεριφορά του χρήστη.
Επιπλέον, παρουσιάζουμε ενα καινούργιο μοντέλο προσομοίωσης της αναζήτησης χρηστών, το οποίο αποσκοπεί στον προσδιοριμό της σχετικότητας ενός κειμένου μελετώντας δεδομένα που αφήνει ο χρήστης κατα την αναζήτηση. Το συγκεκριμένο μοντέλο βασίζεται στην θεωρία των δυναμικών δικτύων Bayes και χρησιμοποιεί την έννοια της δημοφιλίας για να πετύχει καλύτερη εκτίμηση της πραγματικής σχετικότητας ενός κειμένου. / Information retrieval constitutes an important scientific area of the computer science, that focuses on the extraction of amounts of unstructured information (usually text from documents) from large collections (corpora, etc.) according to a special information need of a user. Over the last years, one major task of information retrieval research is the evaluation of the retrieval process. As a result, a vast amount of evaluation metrics and user models have been developed, trying to best model user's behaviour during the search.
In this thesis we propose a new evaluation metric which aims at the best evaluation of search process from the perspective of user's behaviour. A conventional approach when estimating the relevance of a document is by using relevance judgements from assessors that are responsible to assess whether a document is relevant according to a specific query. However, relevance judgements do not always reflect the opinion of every user, rather from a small proportion only. Our evaluation metric introduces a novel factor of relevance, document popularity which can be seen as user's vote for a document. Thus, by employing a linear combination of relevance judgements and popularity, we achieve a better explanation of user's behaviour.
Additionally, we present a novel click user model which by the best modelling of user's navigational behaviour, aims at the best estimation of the relevance of a document. This particular user model, is based on the dynamic Bayesian networks theory and employs the notion of popularity in order to better estimate actual document relevance, rather perceived relevance, that most other models do.
|
409 |
Automatic identification of causal relations in text and their use for improving precision in information retrievalKhoo, Christopher S. G. 12 1900 (has links)
Parts of the thesis were published in:
1. Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145.
2. Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186.
3. Khoo, C. (1997). The use of relation matching in information retrieval. LIBRES: Library and Information Science Research Electronic Journal [Online], 7(2). Available at: http://aztec.lib.utk.edu/libres/libre7n2/.
An update of the literature review on causal relations in text was published in: Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer / This study represents one attempt to make use of relations expressed in text to improve information retrieval effectiveness. In particular, the study investigated whether the information obtained by matching causal relations expressed in documents with the causal relations expressed in users' queries could be used to improve document retrieval results in comparison to using just term matching without considering relations.
An automatic method for identifying and extracting cause-effect information in Wall Street Journal text was developed. The method uses linguistic clues to identify causal relations without recourse to knowledge-based inferencing. The method was successful in identifying and extracting about 68% of the causal relations that were clearly expressed within a sentence or between adjacent sentences in Wall Street Journal text. Of the instances that the computer program identified as causal relations, 72% can be considered to be correct.
The automatic method was used in an experimental information retrieval system to identify causal relations in a database of full-text Wall Street Journal documents. Causal relation matching was found to yield a small but significant improvement in retrieval results when the weights used for combining the scores from different types of matching were customized for each query -- as in an SDI or routing queries situation. The best results were obtained when causal relation matching was combined with word proximity matching (matching pairs of causally related words in the query with pairs of words that co-occur within document sentences). An analysis using manually identified causal relations indicate that bigger retrieval improvements can be expected with more accurate identification of causal relations. The best kind of causal relation matching was found to be one in which one member of the causal relation (either the cause or the effect) was represented as a wildcard that could match with any term.
The study also investigated whether using Roget's International Thesaurus (3rd ed.) to expand query terms with synonymous and related terms would improve retrieval effectiveness. Using Roget category codes in addition to keywords did give better retrieval results. However, the Roget codes were better at identifying the non-relevant documents than the relevant ones.
Parts of the thesis were published in:
1. Khoo, C., Myaeng, S.H., & Oddy, R. (2001). Using cause-effect relations in text to improve information retrieval precision. Information Processing and Management, 37(1), 119-145.
2. Khoo, C., Kornfilt, J., Oddy, R., & Myaeng, S.H. (1998). Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Literary & Linguistic Computing, 13(4), 177-186.
3. Khoo, C. (1997). The use of relation matching in information retrieval. LIBRES: Library and Information Science Research Electronic Journal [Online], 7(2). Available at: http://aztec.lib.utk.edu/libres/libre7n2/.
An update of the literature review on causal relations in text was published in: Khoo, C., Chan, S., & Niu, Y. (2002). The many facets of the cause-effect relation. In R.Green, C.A. Bean & S.H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective (pp. 51-70). Dordrecht: Kluwer
|
410 |
Discriminating Meta-Search: A Framework for EvaluationChignell, Mark, Gwizdka, Jacek, Bodner, Richard January 1999 (has links)
DOI: 10.1016/S0306-4573(98)00065-X / There was a proliferation of electronic information sources and search engines in the 1990s. Many of these information sources became available through the ubiquitous interface of the Web browser. Diverse information sources became accessible to information professionals and casual end users alike. Much of the information was also hyperlinked, so that information could be explored by browsing as well as searching. While vast amounts of information were now just a few keystrokes and mouseclicks away, as the choices multiplied, so did the complexity of choosing where and how to look for the electronic information. Much of the complexity in information exploration at the turn of the twenty-first century arose because there was no common cataloguing and control system across the various electronic information sources. In addition, the many search engines available differed widely in terms of their domain coverage, query methods, and efficiency.
Meta-search engines were developed to improve search performance by querying multiple search engines at once. In principle, meta-search engines could greatly simplify the search for electronic information by selecting a subset of first-level search engines and digital libraries to submit a query to based on the characteristics of the user, the query/topic, and the search strategy. This selection would be guided by diagnostic knowledge about which of the first-level search engines works best under what circumstances. Programmatic research is required to develop this diagnostic knowledge about first-level search engine performance.
This paper introduces an evaluative framework for this type of research and illustrates its use in two experiments. The experimental results obtained are used to characterize some properties of leading search engines (as of 1998). Significant interactions were observed between search engine and two other factors (time of day, and Web domain). These findings supplement those of earlier studies, providing preliminary information about the complex relationship between search engine functionality and performance in different contexts. While the specific results obtained represent a time-dependent snapshot of search engine performance in 1998, the evaluative framework proposed should be generally applicable in the future.
|
Page generated in 0.0839 seconds