Global ETD Search

11	The smoothed Dirichlet distribution: Understanding cross -entropy ranking in information retrieval Nallapati, Ramesh 01 January 2006 (has links) Unigram Language modeling is a successful probabilistic framework for Information Retrieval (IR) that uses the multinomial distribution to model documents and queries. An important feature in this approach is the usage of the empirically successful cross-entropy function between the query model and document models as a document ranking function. However, this function does not follow directly from the underlying models and as such there is no justification available for its usage till date. Another related and interesting observation is that the naïve Bayes model for text classification uses the same multinomial distribution to model documents but in contrast, employs document-log-likelihood that follows directly from the model, as a scoring function. Curiously, the document-log-likelihood closely corresponds to cross entropy, but to an asymmetric counterpart of the function used in language modeling. It has been empirically demonstrated that the version of cross entropy used in IR is a better performer than document-log-likelihood, but this interesting phenomenon remains largely unexplained. One of the main objectives of this work is to develop a theoretical understanding of the reasons for the success of the version of cross entropy function used for ranking in IR. We also aim to construct a likelihood based generative model that directly corresponds to this cross-entropy function. Such a model, if successful, would allow us to view IR essentially as a machine learning problem. A secondary objective is to bridge the gap between the generative approaches used in IR and text classification through a unified model. In this work we show that the cross entropy ranking function corresponds to the log-likelihood of documents w.r.t. the approximate Smoothed-Dirichlet (SD) distribution, a novel variant of the Dirichlet distribution. We also empirically demonstrate that this new distribution captures term occurrence patterns in documents much better than the multinomial, thus offering a reason behind the superior performance of the cross entropy ranking function compared to the multinomial document-likelihood. Our experiments in text classification show that a classifier based on the Smoothed Dirichlet performs significantly better than the multinomial based naïve Bayes model and on par with the Support Vector Machines (SVM), confirming our reasoning. In addition, this classifier is as quick to train as the naïve Bayes and several times faster than the SVMs owing to its closed form maximum likelihood solution, making it ideal for many practical IR applications. We also construct a well-motivated generative classifier for IR based on SD distribution that uses the EM algorithm to learn from pseudo-feedback and show that its performance is equivalent to the Relevance model (RM), a state-of-the-art model for IR in the language modeling framework that uses the same cross-entropy as its ranking function. In addition, the SD based classifier provides more flexibility than RM in modeling documents owing to a consistent generative framework. We demonstrate that this flexibility translates into a superior performance compared to RM on the task of topic tracking, an online classification task. Computer science\|Information systems
12	Retrieval of passages for information reduction Daniels, Jody J 01 January 1997 (has links) Information Retrieval (IR) typically retrieves entire documents in response to a user's information need. However, many times a user would prefer to examine smaller portions of a document. One example of this is when building a frame-based representation of a text. The user would like to read all and only those portions of the text that are about predefined important features. This research addresses the problem of automatically locating text about these features, where the important features are those defined for use by a case-based reasoning (CBR) system in the form of features and values or slots and fillers. To locate important text pieces we gathered a small set of "excerpts", textual segments, when creating the original case-base representations. Each segment contains the local context for a particular feature within a document. We used these excerpts to generate queries that retrieve relevant passages. By locating passages for display to the user, we winnow a text down to sets of several sentences, greatly reducing the time and effort expended searching through each text for important features.
13	The identification of differentiating success factors for students in computer science and computer information systems programs of study Carabetta, James R 01 January 1991 (has links) Although both are computer-based, computer science and computer information systems programs of study are markedly different. Therefore, it is not unreasonable to speculate that success factor differences may exist between them, and to seek an objective means of making such a determination based on a student's traits. The purpose of this study was therefore two-fold--to determine whether differences do in fact exist between successful computer science majors and successful computer information systems majors, and if such was affirmed, to determine a classification rule for such assignment. Based on an aggregate of demographic, pre-college academic, and learning style factors, the groups were found to differ significantly on the following variables (listed in decreasing likelihood of significance, for those with p $<$.05): sex, abstract conceptualization and concrete-abstract continuum measures, SAT - Mathematics, interest ranking for science, active experimentation measure, interest ranking for foreign language, and concrete experience measure. Computer science majors were found to consist of significantly more males than females, and to have significantly higher abstract conceptualization, concrete-abstract continuum, SAT - mathematics, and interest ranking for science measures than computer information systems majors, while computer information systems majors were found to have significantly higher active experimentation, interest ranking for foreign language and concrete experience measures. A classification rule, based on a subset of these factors, was derived and found to classify correctly at a 76.6% rate. These results have potential as a research-based component of an advising function for students interested in pursuing a computer science or computer information systems program of study.
14	Optimizing Sample Design for Approximate Query Processing Rösch, Philipp, Lehner, Wolfgang 30 November 2020 (has links) The rapid increase of data volumes makes sampling a crucial component of modern data management systems. Although there is a large body of work on database sampling, the problem of automatically determine the optimal sample for a given query remained (almost) unaddressed. To tackle this problem the authors propose a sample advisor based on a novel cost model. Primarily designed for advising samples of a few queries specified by an expert, the authors additionally propose two extensions of the sample advisor. The first extension enhances the applicability by utilizing recorded workload information and taking memory bounds into account. The second extension increases the effectiveness by merging samples in case of overlapping pieces of sample advice. For both extensions, the authors present exact and heuristic solutions. Within their evaluation, the authors analyze the properties of the cost model and demonstrate the effectiveness and the efficiency of the heuristic solutions with a variety of experiments. info:eu-repo/classification/ddc/650 ddc:650

Page generated in 0.1217 seconds