1 |
Assessing relevance using automatically translated documents for cross-language information retrievalOrengo, Viviane Moreira January 2004 (has links)
This thesis focuses on the Relevance Feedback (RF) process, and the scenario considered is that of a Portuguese-English Cross-Language Information Retrieval (CUR) system. CUR deals with the retrieval of documents in one natural language in response to a query expressed in another language. RF is an automatic process for query reformulation. The idea behind it is that users are unlikely to produce perfect queries, especially if given just one attempt. The process aims at improving the queryspecification, which will lead to more relevant documents being retrieved. The method consists of asking the user to analyse an initial sample of documents retrieved in response to a query and judge them for relevance. In that context, two main questions were posed. The first one relates to the user's ability in assessing the relevance of texts in a foreign language, texts hand translated into their language and texts automatically translated into their language. The second question concerns the relationship between the accuracy of the participant's judgements and the improvement achieved through the RF process. In order to answer those questions, this work performed an experiment in which Portuguese speakers were asked to judge the relevance of English documents, documents hand-translated to Portuguese, and documents automatically translated to Portuguese. The results show that machine translation is as effective as hand translation in aiding users to assess relevance. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics. This work advances the existing research on RF by considering a CUR scenario and carrying out user experiments, which analyse aspects of RF and CUR that remained unexplored until now. The contributions of this work also include: the investigation of CUR using a new language pair; the design and implementation of a stemming algorithm for Portuguese; and the carrying out of several experiments using Latent Semantic Indexing which contribute data points to the CUR theory.
|
2 |
Linking textual resources to support information discoveryKnoth, Petr January 2015 (has links)
A vast amount of information is today stored in the form of textual documents, many of which are available online. These documents come from different sources and are of different types. They include newspaper articles, books, corporate reports, encyclopedia entries and research papers. At a semantic level, these documents contain knowledge, which was created by explicitly connecting information and expressing it in the form of a natural language. However, a significant amount of knowledge is not explicitly stated in a single document, yet can be derived or discovered by researching, i.e. accessing, comparing, contrasting and analysing, information from multiple documents. Carrying out this work using traditional search interfaces is tedious due to information overload and the difficulty of formulating queries that would help us to discover information we are not aware of. In order to support this exploratory process, we need to be able to effectively navigate between related pieces of information across documents. While information can be connected using manually curated cross-document links, this approach not only does not scale, but cannot systematically assist us in the discovery of sometimes non-obvious (hidden) relationships. Consequently, there is a need for automatic approaches to link discovery. This work studies how people link content, investigates the properties of different link types, presents new methods for automatic link discovery and designs a system in which link discovery is applied on a collection of millions of documents to improve access to public knowledge.
|
3 |
Self-adapting parallel metric-space search engine for variable query loadsAl Ruqeishi, Khalil January 2016 (has links)
This research focuses on automatically adapting a search engine size in response to fluctuations in query workload. Deploying a search engine in an Infrastructure as a Service (IaaS) cloud facilitates allocating or deallocating computer resources to or from the engine. Our solution is to contribute an adaptive search engine that will repeatedly re-evaluate its load and, when appropriate, switch over to a dierent number of active processors. We focus on three aspects and break them out into three sub-problems as follows: Continually determining the Number of Processors (CNP), New Grouping Problem (NGP) and Regrouping Order Problem (ROP). CNP means that (in the light of the changes in the query workload in the search engine) there is a problem of determining the ideal number of processors p active at any given time to use in the search engine and we call this problem CNP. NGP happens when changes in the number of processors are determined and it must also be determined which groups of search data will be distributed across the processors. ROP is how to redistribute this data onto processors while keeping the engine responsive and while also minimising the switchover time and the incurred network load. We propose solutions for these sub-problems. For NGP we propose an algorithm for incrementally adjusting the index to t the varying number of virtual machines. For ROP we present an ecient method for redistributing data among processors while keeping the search engine responsive. Regarding the solution for CNP, we propose an algorithm determining the new size of the search engine by re-evaluating its load. We tested the solution performance using a custom-build prototype search engine deployed in the Amazon EC2 cloud. Our experiments show that when we compare our NGP solution with computing the index from scratch, the incremental algorithm speeds up the index computation 2{10 times while maintaining a similar search performance. The chosen redistribution method is 25% to 50% faster than other methods and reduces the network load around by 30%. For CNP we present a deterministic algorithm that shows a good ability to determine a new size of search engine. When combined, these algorithms give an adapting algorithm that is able to adjust the search engine size with a variable workload.
|
4 |
Towards Nootropia : a non-linear approach to adaptive document filteringNanas, Nikolaos January 2003 (has links)
In recent years, it has become increasingly difficult for users to find relevant information within the accessible glut. Research in Information Filtering (IF) tackles this problem through a tailored representation of the user interests, a user profile. Traditionally, IF inherits techniques from the related and more well established domains of Information Retrieval and Text Categorisation. These include, linear profile representations that exclude term dependencies and may only effectively represent a single topic of interest, and linear learning algorithms that achieve a steady profile adaptation pace. We argue that these practices are not attuned to the dynamic nature of user interests. A user may be interested in more than one topic in parallel, and both frequent variations and occasional radical changes of interests are inevitable over time. With our experimental system "Nootropia", we achieve adaptive document filtering with a single, multi-topic user profile. A hierarchical term network that takes into account topical and lexical correlations between terms and identifies topic-subtopic relations between them, is used to represent a user's multiple topics of interest and distinguish between them. A series of non-linear document evaluation functions is then established on the hierarchical network. Experiments using a variation of TREC's routing subtask to test the ability of a single profile to represent two and three topics of interest, reveal the approach's superiority over a linear profile representation. Adaptation of this single, multi-topic profile to a variety of changes in the user interests, is achieved through a process of self-organisation that constantly readjusts the profile stucturally, in response to user feedback. We used virtual users and another variation of TREC's routing subtask to test the profile on two learning and two forgetting tasks. The results clearly indicate the profile's ability to adapt to both frequent variations and radical changes in user interests.
|
5 |
Information visibility on the Web and conceptions of success and failure in Web searchingMansourian, Yazdan January 2006 (has links)
This thesis reports the procedure and findings of an empirical study about end users' interaction with web-based search tools. The first part is dedicated to address early research questions to discover web user's conceptions of the invisible web. The second part addresses primary research questions to explore web users' conceptualizations of the causes of their search success/failure and their awareness of and reaction to missed information while searching the web. The third part is devoted to a number of emergent research questions to reexamine the dataset in the light of a number of theoretical frameworks including Locus of Control, Self-efficacy, Attribution Theory and Bounded Rationality and Satisficing theory. The data collection was carried out in three phases based on in-depth, open-ended and semi-structured interviews with a sample of academic staff, research staff and research students from three biology-related departments at the University of Sheffield. A combination of inductive and deductive approaches was employed to address three sets of research questions. The first part of analysis which was based on Grounded Theory led to discovery of a new concept called 'information visibility' which does make a distinction between technical objective conceptions of the invisible web that commonly appear in the literature, and a cognitive subjective conception based on searchers' perceptions of search failure. Accordingly, the study introduced a 'model of information visibility on the web' which suggests a complementary definition for the invisible web. Inductive exploration of the data to address the primary research questions culminated in identification of different kinds of success (i.e. anticipated, serendipitous, and unexpected success) and failure (i.e. unexpected, unexplained and inevitable failure). The results also showed that the participants in the study were aware of the possibility of missing some relevant information in their searches and the risk of missing potentially important information is a matter of concern to them. However, regarding the context of each search they have different perceptions of the importance and the volume of missed information and accordingly they react to it differently. In view of that, two matrices including the "matrix of search impact" and the "matrix of search depth" were developed to address users' search behaviours regarding their awareness of and reaction to missed information. The matrix of search impact suggests that there are different perceptions of the risk of missing information including "inconsequential", "tolerable", "damaging" and "disastrous". The matrix of search depth illustrates different search strategies including "minimalist", "opportunistic", "nervous" and "extensive". The third part of the study indicated that Locus of Control and Attribution Theory are useful theoretical frameworks for helping us to better understand web-based information seeking. Furthermore, interpretation of the data with regards to Bounded Rationality and Satisficing theory supported the inductive findings and showed that web users' estimations of the likely volume and importance of missed information affect their decision to persist in searching. At the final stage of the study, an integrative model of information seeking behaviour on the web was developed. This six-layer model incorporates the results of both inductive and deductive stages of the study.
|
6 |
Metaphors for organisations during information systems developmentOates, Briony June January 2000 (has links)
How can we enable conventionally-educated information systems (IS) developers to use a richer model of organisations and move towards an interpretive paradigm? The thesis of this research is that a way can be found by using metaphors for organisations as cognitive structuring devices during IS development. Two interpretive, idiographic studies explore whether and how some systems developers could work with a range of metaphors for organisations. Phase 1 of the fieldwork research involved the development of an information system for a small engineering company. Phase 2 involved the development of information systems for a local authority department, a hospital diabetes centre and a chain of DIY stores. The use of metaphors is analysed using a cognitive psychology theory of metaphor and analogy. It is found that the developers used a range of organisational metaphors. They also linked the mappings generated to IS development issues, showing that the metaphors had practical relevance. A prototype methodology, Multi-Metaphor Methodology, is created. Version 1 has a theoretical basis from previous IS research, organisation analysis and cognitive psychology. Learning outcomes from the two fieldwork studies lead to enhancements in Versions 2 and 3. The methodology is thus based on knowledge from both theory and action. The developers felt it was helpful and recommended its development should continue. Phase 1 used interpretive action research. Issues arising from Phase 1 lead to the proposal of three additional validity criteria: the extent of participation, students as co-researchers, and guarding against self or group delusion. Phase 2 used co-operative inquiry. It is concluded that IS action research can be improved by reference to the literature of cooperative inquiry, which better addresses method, participation, knowledge, and validity. It is also concluded that co-operative inquiry can be improved by adding validity criteria from the IS action research literature. Contributions to knowledge are made in the use of organisational metaphors and cognitive psychology theory, the development of a methodology, two interpretive studies and an examination of research methods and validity.
|
7 |
Detection of unsolicited web browsing with clustering and statistical analysisChwalinski, Pawel January 2014 (has links)
Unsolicited web browsing denotes illegitimate accessing or processing web content. The harmful activity varies from extracting e-mail information to downloading entire website for duplication. In addition, computer criminals prevent legitimate users from gaining access to websites by implementing a denial of service attack with high-volume legitimate traffic. These offences are accomplished by preprogrammed machines that avoid rate-dependent intrusion detection systems. Therefore, it is assumed in this thesis that the only difference between a legitimate and malicious web session is in the intention rather than physical characteristics or network-layer information. As a result, the main aim of this research has been to provide a method of malicious intention detection. This has been accomplished by two-fold process. Initially, to discover most recent and popular transitions of lawful users, a clustering method has been introduced based on entropy minimisation. In principle, by following popular transitions among the web objects, the legitimate users are placed in low-entropy clusters, as opposed to the undesired hosts whose transitions are uncommon, and lead to placement in high-entropy clusters. In addition, by comparing distributions of sequences of requests generated by the actual and malicious users across the clusters, it is possible to discover whether or not a website is under attack. Secondly, a set of statistical measurements have been tested to detect the actual intention of browsing hosts. The intention classification based on Bayes factors and likelihood analysis have provided the best results. The combined approach has been validated against actual web traces (i.e. datasets), and generated promising results.
|
8 |
Developing complex information systems : the use of a geometric data structure to aid the specification of a multi-media information environmentWarman, A. R. January 1990 (has links)
The enormous computing power available today has resulted in the acceptance of information technology into a wide range of applications, the identification or creation of numerous problem areas, and the considerable tasks of finding problem solutions. Using computers for handling the current data manipulation tasks which characterise modern information processing requires considerably more sophisticated hardware and software technologies. Yet the development of more 'enhanced' packages frequently requires hundreds of man-years. Similarly, computer hardware design has become so complicated that only by using existing computers is it possible to develop newer machines. The common characteristic of such data manipulation tasks is that much larger amounts of information in evermore complex arrangements are being handled at greater speeds. Instead of being 'concrete' or 'black and white', issues at the higher levels of information processing can appear blurred - there may be much less precision because situations, perspectives and circumstances can vary. Most current packages focus on specific task areas, but the modern information processing environment actually requires a broader range of functions that cooperate in integrating and relating information handling activities in a manner far beyond that normally offered. It would seem that a fresh approach is required to examine all of the constituent problems. This report describes the research work carried out during such a consideration, and details the specification and development of a suggested method for enhancing information systems by specifying a multimedia information environment. This thesis develops a statement of the perceived problems, using extensive references to the current state of information system technologies. Examples are given of how some current systems approach the multiple tasks of processing and sharing data and applications. The discussion then moves to consider further what the underlying objectives of information handling - and a suitable integration architecture - should perhaps be, and shows how some current systems do not really meet these aims, although they incorporate certain of the essential fundamentals that contribute towards more enhanced information handling. The discussion provides the basis for the specification and construction of complete, integrated Information Environment applications. The environments are used to describe not only the jobs which the user wishes to carry out, but also the circumstances under which the job is being performed. The architecture uses a new geometric data structure to facilitate manipulation of the working data, relationships, and the environment description. The manipulation is carried out spatially, and this allows the user to work using a geometric representation of the data components, thus supporting the abstract nature of some information handling tasks.
|
9 |
Concepts of relevance in a semiotic framework applied to ISAD (Information Systems Analysis and Design)Kitiyadisai, Krisana January 1991 (has links)
Relevance is the critical criterion for valuing information. The usual requirements of valuable information resources are their accuracy, brevity, timeliness and rarity. This thesis points out that relevance has to be explicitly recognised as an important quality of information. Therefore, the theory of signs is adopted to enable a systematic study of the problem of relevance according to the branches of semiotics in order to clarify the concept of information. Relevance has several meanings according to the various disciplinary approaches including phenomenology, law, logic, information science, communication and cognition. These different concepts are discussed and criticised in two chapters. A new approach is proposed in which a universal concept of relevance is considered as an affordance. Therefore, all the approaches to relevance can be applied within the broader approach of the analysis of affordances. This approach not only encompasses all the underlying characteristics of relevance, it is also compatible with the assumptions of the logic of norms and affordances (NORMA). NORMA semantic analysis is used as a basis on which concepts of relevance are applied semiotically. Two case- studies are selected for testing these concepts which results in a guideline for practical application in a semiotic framework. The results from these case-studies confirm the practical importance of these concepts of relevance which can be systematically used in the analysis and design of information systems. It also reaffirms the underlying characteristics of relevance which exist in the context of social reality.
|
10 |
A practical approach to set orientated query execution in semistructured databasesDu, Chu-Ming January 2003 (has links)
The amount of semistructured data is growing rapidly as the World Wide Web has developed into a central means for sharing and disseminating information. The structure of tree-like semistructured data is not rigid. The most common instance of this type of data is XML. Applications endeavouring to access components of semistructured data are naturally inclined towards a recursive approach to navigate data on trees. However, conventional wisdom indicates that a set-oriented mechanism is necessary for database management systems to obtain good performance in the presence of large amounts of data. Our main objective in this thesis is to develop a set-oriented query execution scheme for XML data. We propose a system, called "Equate" (Execution of Queries Using an Automata Theoretic Engine), which intelligently utilises an automata rewriting scheme to transform a query language into an internal query plan with relational-like operators scheduled in a single process for a set-oriented execution. Our approach contains two phases. The first phase, set-oriented execution, performs queries on edges and binds any variables required. The second phase, reachability analysis, refines the result, filtering out any false matches, and collects sets of variable bindings into a final result structure. " A novel aspect of our approach is that our set-oriented execution, even for complex queries, requires only variants of the relational select, project, and union operators, but no joins.
|
Page generated in 0.1168 seconds