Spelling suggestions: "subject:"forminformation retrieval"" "subject:"forminformation retrived""
1 |
Statistical Source Expansion for Question AnsweringSchlaefer, Nico 01 January 2011 (has links)
A source expansion algorithm automatically extends a given text corpus with related information from large, unstructured sources. While the expanded corpus is not intended for human consumption, it can be leveraged in question answering (QA) and other information retrieval or extraction tasks to find more relevant knowledge and to gather additional evidence for evaluating hypotheses. In this thesis, we propose a novel algorithm that expands a collection of seed documents by (1) retrieving related content from the Web or other large external sources, (2) extracting self-contained text nuggets from the related content, (3) estimating the relevance of the text nuggets with regard to the topics of the seed documents using a statistical model, and (4) compiling new pseudo-documents from nuggets that are relevant and complement existing information. In an intrinsic evaluation on a dataset comprising 1,500 hand-labeled web pages, the most elective statistical relevance model ranked text nuggets by relevance with 81% MAP, compared to 43% when relying on rankings generated by a web search engine, and 75% when using a multi-document summarization algorithm. These differences are statistically significant and result in noticeable gains in search performance in a task-based evaluation on QA datasets. The statistical models use a comprehensive set of features to predict the topicality and quality of text nuggets based on topic models built from seed content, search engine rankings and surface characteristics of the retrieved text. Linear models that evaluate text nuggets individually are compared to a sequential model that estimates their relevance given the surrounding nuggets. The sequential model leverages features derived from text segmentation algorithms to dynamically predict transitions between relevant and irrelevant passages. It slightly outperforms the best linear model while using fewer parameters and requiring less training time. In addition, we demonstrate that active learning reduces the amount of labeled data required to fit a relevance model by two orders of magnitude with little loss in ranking performance. This facilitates the adaptation of the source expansion algorithm to new knowledge domains and applications. Applied to the QA task, the proposed method yields consistent and statistically significant performance gains across different datasets, seed corpora and retrieval strategies. We evaluated the impact of source expansion on search performance and end-to-end accuracy using Watson and the OpenEphyra QA system, and datasets comprising over 6,500 questions from the Jeopardy! quiz show and TREC evaluations. By expanding various seed corpora with web search results, we were able to improve the QA accuracy of Watson from 66% to 71% on regular Jeopardy! questions, from 45% to 51% on Final Jeopardy! questions and from 59% to 64% on TREC factoid questions. We also show that the source expansion approach can be adapted to extract relevant content from locally stored sources without requiring a search engine, and that this method yields similar performance gains. When combined with the approach that uses web search results, Watson's accuracy further increases to 72% on regular Jeopardy! data, 54% on Final Jeopardy! and 67% on TREC questions.
|
2 |
Modelling intelligent agents for web-based information gathering.Li, Yuefeng, mikewood@deakin.edu.au January 2000 (has links)
The recent emergence of intelligent agent technology and advances in information gathering have been the important steps forward in efficiently managing and using the vast amount of information now available on the Web to make informed decisions. There are, however, still many problems that need to be overcome in the information gathering research arena to enable the delivery of relevant information required by end users.
Good decisions cannot be made without sufficient, timely, and correct information. Traditionally it is said that knowledge is power, however, nowadays sufficient, timely, and correct information is power. So gathering relevant information to meet user information needs is the crucial step for making good decisions.
The ideal goal of information gathering is to obtain only the information that users need (no more and no less). However, the volume of information available, diversity formats of information, uncertainties of information, and distributed locations of information (e.g. World Wide Web) hinder the process of gathering the right information to meet the user needs. Specifically, two fundamental issues in regard to efficiency of information gathering are mismatch and overload. The mismatch means some information that meets user needs has not been gathered (or missed out), whereas, the overload means some gathered information is not what users need.
Traditional information retrieval has been developed well in the past twenty years. The introduction of the Web has changed people's perceptions of information retrieval. Usually, the task of information retrieval is considered to have the function of leading the user to those documents that are relevant to his/her information needs. The similar function in information retrieval is to filter out the irrelevant documents (or called information filtering). Research into traditional information retrieval has provided many retrieval models and techniques to represent documents and queries. Nowadays, information is becoming highly distributed, and increasingly difficult to gather. On the other hand, people have found a lot of uncertainties that are contained in the user information needs. These motivate the need for research in agent-based information gathering.
Agent-based information systems arise at this moment. In these kinds of systems, intelligent agents will get commitments from their users and act on the users behalf to gather the required information. They can easily retrieve the relevant information from highly distributed uncertain environments because of their merits of intelligent, autonomy and distribution. The current research for agent-based information gathering systems is divided into single agent gathering systems, and multi-agent gathering systems. In both research areas, there are still open problems to be solved so that agent-based information gathering systems can retrieve the uncertain information more effectively from the highly distributed environments.
The aim of this thesis is to research the theoretical framework for intelligent agents to gather information from the Web. This research integrates the areas of information retrieval and intelligent agents. The specific research areas in this thesis are the development of an information filtering model for single agent systems, and the development of a dynamic belief model for information fusion for multi-agent systems. The research results are also supported by the construction of real information gathering agents (e.g., Job Agent) for the Internet to help users to gather useful information stored in Web sites. In such a framework, information gathering agents have abilities to describe (or learn) the user information needs, and act like users to retrieve, filter, and/or fuse the information.
A rough set based information filtering model is developed to address the problem of overload. The new approach allows users to describe their information needs on user concept spaces rather than on document spaces, and it views a user information need as a rough set over the document space. The rough set decision theory is used to classify new documents into three regions: positive region, boundary region, and negative region. Two experiments are presented to verify this model, and it shows that the rough set based model provides an efficient approach to the overload problem.
In this research, a dynamic belief model for information fusion in multi-agent environments is also developed. This model has a polynomial time complexity, and it has been proven that the fusion results are belief (mass) functions. By using this model, a collection fusion algorithm for information gathering agents is presented. The difficult problem for this research is the case where collections may be used by more than one agent. This algorithm, however, uses the technique of cooperation between agents, and provides a solution for this difficult problem in distributed information retrieval systems.
This thesis presents the solutions to the theoretical problems in agent-based information gathering systems, including information filtering models, agent belief modeling, and collection fusions. It also presents solutions to some of the technical problems in agent-based information systems, such as document classification, the architecture for agent-based information gathering systems, and the decision in multiple agent environments. Such kinds of information gathering agents will gather relevant information from highly distributed uncertain environments.
|
3 |
非對稱性加權之排名學習機制 / Leaning to rank with asymmetric discordant penalty王榮聖, Wang, Rung Sheng Unknown Date (has links)
資訊發達的時代,資訊取得的方式與管道比起以前更方便而多元,但龐大資料量同時也造成了我們往往很難找到真正需要資料的問題,也因此資料的排名(ranking)問題就變得十分重要。本研究目的在於運用排名學習找出良好的排名,利用人對於某特定議題所給予的排名順序找出排名規則,並應用於資料探勘上,讓電腦可自動對資料做評分,產生正確的排序,將有助於資料的搜尋。
本研究分為兩部分,第一部份為排名演算法的設計,我們改良現有的排名方法(RankBoost),設計出另一個新的演算法(RealRankBoost),並且用LETOR benchmark實測,作為與其他方法的比較和效果提升的證明;第二部份為非對稱加權概念的提出,我們考量排名位置所造成的資料被檢視機率不同,而給予不同的權重,使排名結果能更貼近人類的角度。 / With the innovation in computer technology, we have easier ways to access information. But the huge amount of data also makes it hard for us to find what we really want. This is why ranking is important to us. The central issues of many applications are ranking, such as document retrieval, expert finding, and anti spam. The objective of this thesis is to discover a good ranking function according to specific ranking order of the human perceptions. We employ the learning-to-rank approach to automatically score and generate ranking order that helps data searching.
This thesis is divided into two parts. Firstly, we design a new learning-to-rank algorithm named RealRankBoost based on an existing method (RankBoost). We investigate the efficacy of the proposed method by performing comparative analysis using the LETOR benchmark. Secondly, we propose to assign asymmetric weightings for ranking in the sense that incorrect placement of top-ranked items should yield higher penalty. Incorporation of the asymmetric weighting technique will further make our system to mimic human ranking strategy.
|
4 |
Sentiment-Driven Topic Analysis Of Song LyricsSharma, Govind 08 1900 (has links) (PDF)
Sentiment Analysis is an area of Computer Science that deals with the impact a document makes on a user. The very field is further sub-divided into Opinion Mining and Emotion Analysis, the latter of which is the basis for the present work. Work on songs is aimed at building affective interactive applications such as music recommendation engines. Using song lyrics, we are interested in both supervised and unsupervised analyses, each of which has its own pros and cons.
For an unsupervised analysis (clustering), we use a standard probabilistic topic model called Latent Dirichlet Allocation (LDA). It mines topics from songs, which are nothing but probability distributions over the vocabulary of words. Some of the topics seem sentiment-based, motivating us to continue with this approach. We evaluate our clusters using a gold dataset collected from an apt website and get positive results. This approach would be useful in the absence of a supervisor dataset.
In another part of our work, we argue the inescapable existence of supervision in terms of having to manually analyse the topics returned. Further, we have also used explicit supervision in terms of a training dataset for a classifier to learn sentiment specific classes. This analysis helps reduce dimensionality and improve classification accuracy. We get excellent dimensionality reduction using Support Vector Machines (SVM) for feature selection. For re-classification, we use the Naive Bayes Classifier (NBC) and SVM, both of which perform well. We also use Non-negative Matrix Factorization (NMF) for classification, but observe that the results coincide with those of NBC, with no exceptions. This drives us towards establishing a theoretical equivalence between the two.
|
Page generated in 0.1045 seconds