Spelling suggestions: "subject:"forminformation retrieval"" "subject:"informationation retrieval""
671 |
COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGYHinderer, Eugene Waverly, III 01 January 2019 (has links)
Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role in organizing terminology used to describe—among other concepts—the functions, locations, and processes of genes and gene-products. Due to the consistency and level of automation that ontologies provide for such annotations, methods for finding enriched biological terminology from a set of differentially identified genes in a tissue or cell sample have been developed to aid in the elucidation of disease pathology and unknown biochemical pathways. However, despite their immense utility, biomedical ontologies have significant limitations and caveats. One major issue is that gene annotation enrichment analyses often result in many redundant, individually enriched ontological terms that are highly specific and weakly justified by statistical significance. These large sets of weakly enriched terms are difficult to interpret without manually sorting into appropriate functional or descriptive categories. Also, relationships that organize the terminology within these ontologies do not contain descriptions of semantic scoping or scaling among terms. Therefore, there exists some ambiguity, which complicates the automation of categorizing terms to improve interpretability.
We emphasize that existing methods enable the danger of producing incorrect mappings to categories as a result of these ambiguities, unless simplified and incomplete versions of these ontologies are used which omit problematic relations. Such ambiguities could have a significant impact on term categorization, as we have calculated upper boundary estimates of potential false categorizations as high as 121,579 for the misinterpretation of a single scoping relation, has_part, which accounts for approximately 18% of the total possible mappings between terms in the Gene Ontology. However, the omission of problematic relationships results in a significant loss of retrievable information. In the Gene Ontology, this accounts for a 6% reduction for the omission of a single relation. However, this percentage should increase drastically when considering all relations in an ontology. To address these issues, we have developed methods which categorize individual ontology terms into broad, biologically-related concepts to improve the interpretability and statistical significance of gene-annotation enrichment studies, meanwhile addressing the lack of semantic scoping and scaling descriptions among ontological relationships so that annotation enrichment analyses can be performed across a more complete representation of the ontological graph.
We show that, when compared to similar term categorization methods, our method produces categorizations that match hand-curated ones with similar or better accuracy, while not requiring the user to compile lists of individual ontology term IDs. Furthermore, our handling of problematic relations produces a more complete representation of ontological information from a scoping perspective, and we demonstrate instances where medically-relevant terms--and by extension putative gene targets--are identified in our annotation enrichment results that would be otherwise missed when using traditional methods. Additionally, we observed a marginal, yet consistent improvement of statistical power in enrichment results when our methods were used, compared to traditional enrichment analyses that utilize ontological ancestors. Finally, using scalable and reproducible data workflow pipelines, we have applied our methods to several genomic, transcriptomic, and proteomic collaborative projects.
|
672 |
Da information findability à image findability : aportes da polirrepresentação, recuperação e comportamento de busca /Roa-Martínez, Sandra Milena. January 2019 (has links)
Orientador: Silvana Aparecida Borsetti Gregorio Vidotti / Coorientador: Juan Antonio Pastor-Sánchez / Banca: Silvana Drumon Monteiro / Banca: Ana Carolina Simionato Arakaki / Banca: Fernando Luiz Vechiato / Banca: José Eduardo Santarém Segundo / Resumo: Os avanços tecnológicos na sociedade têm possibilitado a inusitada geração e disponibilização de informação nos diversos âmbitos pelos múltiplos dispositivos e em diferentes formatos. A informação para ser acessada e usada pelos usuários nos ambientes digitais deverá previamente ser recuperada e encontrada. Diante disso, destaca-se que a Recuperação da Informação é amplamente discutida em múltiplos estudos desde a origem da Ciência da Informação e da Ciência da Computação, enquanto que a Findability torna-se foco de estudos nos últimos anos. Nesse contexto, com o intuito de esclarecer a relação entre a Recuperação da Informação e a Findability, e como esses processos acontecem nas imagens digitais - consideradas recursos imagéticos de natureza complexa pelas camadas de conteúdo que devem ser analisadas no processo de representação -, objetiva-se contribuir no aprimoramento da Recuperação e Findability com foco nas imagens digitais mediante o uso da polirrepresentação e das tecnologias da Web Semântica. Diante disto, a Ciência da Informação oferece subsídios que possibilitam trabalhos nessas temáticas com uma abordagem cientifica e tecnológica desde a integração dos diferentes conteúdos e informações dos recursos imagéticos e das necessidades informacionais do usuário. Para tanto, a metodologia desta pesquisa se caracteriza por ser de natureza básica que se tornou aplicada, quali-quantitativa e de tipo exploratória e descritiva, com um delineamento baseado no uso do método qua... (Resumo completo, clicar acesso eletrônico abaixo) / Abstract: Technological advances in society have made possible the unusual generation and availability of information in the various scopes by multiple devices and in different formats. Information to be accessed and used by users in digital environments must first be retrieved and found. Therefore, it is important to highlight that Information Retrieval is widely discussed in multiple studies since the origin of Information Science and Computer Science, while Findability has become a focus of studies in recent years. In this context, in order to clarify the relationship between Information Retrieval and Findability, and how these processes take place in digital images - considered imagery resources of a complex nature by the content layers that must be analyzed in the representation process - aims to contribute to the enhancement of Retrieval and Findability focusing on digital images through the use of polyrepresentation and Semantic Web technologies. Faced with this, Information Science offers subsidies that enable work on these issues with a scientific and technological approach since the integration of different contents and information of the imagery resources and informational needs of the user. For this, the methodology of this research is characterized by being basic nature that has become applied, qualitative-quantitative and exploratory and descriptive, with a design based on the use of the quadripolar method using techniques such as bibliographic survey and document analysi... (Complete abstract click electronic access below) / Doutor
|
673 |
Computational methods for mining health communications in web 2.0Bhattacharya, Sanmitra 01 May 2014 (has links)
Data from social media platforms are being actively mined for trends and patterns of interests. Problems such as sentiment analysis and prediction of election outcomes have become tremendously popular due to the unprecedented availability of social interactivity data of different types. In this thesis we address two problems that have been relatively unexplored. The first problem relates to mining beliefs, in particular health beliefs, and their surveillance using social media. The second problem relates to investigation of factors associated with engagement of U.S. Federal Health Agencies via Twitter and Facebook.
In addressing the first problem we propose a novel computational framework for belief surveillance. This framework can be used for 1) surveillance of any given belief in the form of a probe, and 2) automatically harvesting health-related probes. We present our estimates of support, opposition and doubt for these probes some of which represent true information, in the sense that they are supported by scientific evidence, others represent false information and the remaining represent debatable propositions. We show for example that the levels of support in false and debatable probes are surprisingly high. We also study the scientific novelty of these probes and find that some of the harvested probes with sparse scientific evidence may indicate novel hypothesis. We also show the suitability of off-the-shelf classifiers for belief surveillance. We find these classifiers are quite generalizable and can be used for classifying newly harvested probes. Finally, we show the ability of harvesting and tracking probes over time. Although our work is focused in health care, the approach is broadly applicable to other domains as well.
For the second problem, our specific goals are to study factors associated with the amount and duration of engagement of organizations. We use negative binomial hurdle regression models and Cox proportional hazards survival models for these. For Twitter, the hurdle analysis shows that presence of user-mention is positively associated with the amount of engagement while negative sentiment has inverse association. Content of tweets is also equally important for engagement. The survival analyses indicate that engagement duration is positively associated with follower count. For Facebook, both hurdle and survival analyses show that number of page likes and positive sentiment are correlated with higher and prolonged engagement while few content types are negatively correlated with engagement. We also find patterns of engagement that are consistent across Twitter and Facebook.
|
674 |
Mining for evidence in enterprise corporaAlmquist, Brian Alan 01 May 2011 (has links)
The primary research aim of this dissertation is to identify the strategies that best meet the information retrieval needs as expressed in the "e-discovery" scenario. This task calls for a high-recall system that, in response to a request for all available relevant documents to a legal complaint, effectively prioritizes documents from an enterprise document collection in order of likelihood of relevance. High recall information retrieval strategies, such as those employed for e-discovery and patent or medical literature searches, reflect high costs when relevant documents are missed, but they also carry high document review costs.
Our approaches parallel the evaluation opportunities afforded by the TREC Legal Track. Within the ad hoc framework, we propose an approach that includes query field selection, techniques for mitigating OCR error, term weighting strategies, query language reduction, pseudo-relevance feedback using document metadata and terms extracted from documents, merging result sets, and biasing results to favor documents responsive to lawyer-negotiated queries. We conduct several experiments to identify effective parameters for each of these strategies.
Within the relevance feedback framework, we use an active learning approach informed by signals from collected prior relevance judgments and ranking data. We train a classifier to prioritize the unjudged documents retrieved using different ad hoc information retrieval techniques applied to the same topic. We demonstrate significant improvements over heuristic rank aggregation strategies when choosing from a relatively small pool of documents. With a larger pool of documents, we validate the effectiveness of the merging strategy as a means to increase recall, but that sparseness of judgment data prevents effective ranking by the classifier-based ranker.
We conclude our research by optimizing the classifier-based ranker and applying it to other high recall datasets. Our concluding experiments consider the potential benefits to be derived by modifying the merged runs using methods derived from social choice models. We find that this technique, Local Kemenization, is hampered by the large number of documents and the minimal number of contributing result sets to the ranked list. This two-stage approach to high-recall information retrieval tasks continues to offer a rich set of research questions for future research.
|
675 |
Adapting Automatic Summarization to New Sources of InformationOuyang, Jessica Jin January 2019 (has links)
English-language news articles are no longer necessarily the best source of information. The Web allows information to spread more quickly and travel farther: first-person accounts of breaking news events pop up on social media, and foreign-language news articles are accessible to, if not immediately understandable by, English-speaking users. This thesis focuses on developing automatic summarization techniques for these new sources of information.
We focus on summarizing two specific new sources of information: personal narratives, first-person accounts of exciting or unusual events that are readily found in blog entries and other social media posts, and non-English documents, which must first be translated into English, often introducing translation errors that complicate the summarization process. Personal narratives are a very new area of interest in natural language processing research, and they present two key challenges for summarization. First, unlike many news articles, whose lead sentences serve as summaries of the most important ideas in the articles, personal narratives provide no such shortcuts for determining where important information occurs in within them; second, personal narratives are written informally and colloquially, and unlike news articles, they are rarely edited, so they require heavier editing and rewriting during the summarization process. Non-English documents, whether news or narrative, present yet another source of difficulty on top of any challenges inherent to their genre: they must be translated into English, potentially introducing translation errors and disfluencies that must be identified and corrected during summarization.
The bulk of this thesis is dedicated to addressing the challenges of summarizing personal narratives found on the Web. We develop a two-stage summarization system for personal narrative that first extracts sentences containing important content and then rewrites those sentences into summary-appropriate forms. Our content extraction system is inspired by contextualist narrative theory, using changes in writing style throughout a narrative to detect sentences containing important information; it outperforms both graph-based and neural network approaches to sentence extraction for this genre. Our paraphrasing system rewrites the extracted sentences into shorter, standalone summary sentences, learning to mimic the paraphrasing choices of human summarizers more closely than can traditional lexicon- or translation-based paraphrasing approaches.
We conclude with a chapter dedicated to summarizing non-English documents written in low-resource languages – documents that would otherwise be unreadable for English-speaking users. We develop a cross-lingual summarization system that performs even heavier editing and rewriting than does our personal narrative paraphrasing system; we create and train on large amounts of synthetic errorful translations of foreign-language documents. Our approach produces fluent English summaries from disdisfluent translations of non-English documents, and it generalizes across languages.
|
676 |
A personalised query expansion approach using contextSeher, Indra, University of Western Sydney, College of Health and Science, School of Computing and Mathematics January 2007 (has links)
Users of the Web usually use search engines to find answers to a variety of questions. Although search engines can rapidly process a large number of Web documents, in many cases, the answers returned by search engines are not relevant to the user’s information need, although they do contain the same keywords as the query. This is because the Web contains information sources created by numerous authors independently, and the authors’ vocabularies vary greatly. Furthermore, most words in natural languages have inherent ambiguity. This vocabulary mismatch between user queries and Web sources is often addressed through query expansion. Moreover, user questions are often short. The results of a search can be improved when the length of the question is long. Various query expansion methods that add useful question-related terms before processing the question have been proposed and proven to increase the performance of the result. Some of these query expansion methods add contextual information related to the user and the question. On the other hand, human communications are quite successful and seem to be very easy. This is mainly due to the understanding of language and the world knowledge that humans have. Human communication is more successful when there is an implicit understanding of everyday situations of others who take part in the communication. Here the implicit situational information, or the “context” that humans share, enables them to have a more meaningful interaction amongst themselves. Similar to human–human communications, improving computers’ access to context can increase the richness of human–computer communications, giving more useful computational services to users. Based on the above factors, this research proposes a method to make use of context in order to understand and process user requests. Here, the term “context” means the meanings associated with key query terms and preferences that have to be decided in order to process the query. As in a natural environment, results produced to different users for the same question could vary in an automated system. If the automated system knows users’ preferences related to the question, then it could make use of these preferences to process user queries, producing more relevant and useful results to the user. Hence, a new approach for a personalised query expansion is proposed in this research, where user queries are expanded with user preferences and hence the expanded queries that will be used for processing vary for different users. An architecture that is required for such a Web application to carryout a personalised query expansion with contextual information is also proposed in the thesis. The preferences that could be used for the query expansion are therefore user-specific. Users have different set of preferences depending on the tasks they want to perform. Similar tasks that have same types of preferences can be grouped into task based domains. Hence, user preferences will be the same in a domain, and will vary across domains. Furthermore, there can be different types of subtasks that could be performed within a domain. The set of preferences that could be used for each sub task could vary, and it will be a sub set of the set of preferences of the domain. Hence, an approach for a personalised query expansion which adds user, domain and task-specific preferences to user queries is proposed in this research. The main stages of this expansion are identified and discussed in this thesis. Each of these stages requires different contextual information which is represented in the context model. Out of the main stages identified in the query expansion process, the first three stages, the domain identification, task identification, and missing parameter identification, are explored in the thesis. As the preferences used for the expansion depend on the query domain, it is necessary to identify the domain of the query at first instance. Hence, a domain identification algorithm which makes use of eight different features is proposed in the thesis to identify domains of given queries. This domain identification also reduces the ambiguity of query terms. When the query domain is identified, context/associating meanings of query terms are known. This limits the scope of the possible misinterpretations of query terms. A domain ontology, domain dictionary, and user profile are used by the domain identification algorithm. The domain ontology consists of objects and their categories, attributes of objects and their categories, relationships among objects, and instances and their categories in the domain. The domain dictionary consists of objects and attributes. This is created automatically from the domain ontology. The user profile has the long term preferences of the user that are domain-specific and general. When the domain of the query is known, in order to decide the preferences of the user, the task specified in the query has to be identified. This task identification process is found to be similar in domains with similar activities. Hence, domains are grouped at this stage. These domain groups and the rules that could be used to find out the tasks in the domain groups are identified and discussed in the thesis. For each sub tasks in the domain groups, the types of preferences that could be used to expand user queries are identified and are used to expand user queries. An experiment is designed to evaluate the performance of the proposed approach. The first three stages of the query expansion, the domain identification, task identification, and missing parameter identification, are implemented and evaluated. Samples of five domains are implemented, and queries are collected in these domains from various users. In order to create new domains, a wizard is provided by the system. This system also allows editing the existing domains, domain groups, and types of preferences in sub tasks of the domain groups. Instances of the attributes are manually identified and added to the system using the interface provided by the system. In each of the stages of the query expansion, the results of the queries are manually identified, and are compared with the results produced by the system. The results have confirmed that the proposed method has a positive impact in query expansion. The experiments, results and evaluation of the proposed query expansion approach are also presented in the thesis. The proposed approach for the query expansion could be used by search engines, organisations with a limited set of task domains, and any application that can be improved by making use of personalised query expansion. / Doctor of Philosophy (PhD)
|
677 |
The development of enhanced information retrieval strategies in undergraduates through the application of learning theory: an experimental studyMacpherson, Karen, n/a January 2002 (has links)
In this thesis, teaching and learning issues involved in end-user information
retrieval from electronic databases are examined. A two-stage model of the
information retrieval process, based on information processing theory, is
proposed; and a framework for the teaching of information literacy is
developed.
The efficacy of cognitive psychology as a theoretical framework that enhances
the understanding of a number of information retrieval issues, is discussed.
These issues include: teaching strategies that can assist the development of
conceptual knowledge of the information retrieval process; individual
differences affecting information retrieval performance, particularly problemsolving
ability; and expert and novice differences in search performance.
The researcher investigated the impact of concept-based instruction on the
development of information retrieval skills through the use of a two-stage
experimental study conducted with undergraduates students at the
University of Canberra, Australia. Phase 1 was conducted with 254 first-year
undergraduates in 1997, with a 40 minute concept-based teaching module as
the independent variable. A number of research questions were proposed:
1. Wdl type of instruction influence acquisition of knowledge of
electronic database searching?
2. Will type of instruction influence information retrieval effectiveness?
3. Are problem-solving ability and information retrieval effectiveness
related?
4. Are problem-solving ability and cognitive maturity related?
5. Are there any differences in the search behaviour of more effective and
less effective searchers?
Subjects completed a pre-test which measured knowledge of electronic
databases, and problem-solving ability; and a post-test that measured changes
in these abilities. Subjects in the experimental treatment were taught the 40
minute concept-based module, which incorporated teaching strateges
grounded in leaming theory. The strategies included: the use of analogy;
modelling; and the introduction of complexity. The aims of the module were
to foster the development of a realistic concept of the information retrieval
process; and to provide a problem-solving heuristic to guide subjects in their
search strategy formulation. All subjects completed two post-tests: a survey
that measured knowledge of search terminology and strategies; and an
information retrieval assignment that measured effectiveness of search design
and execution.
Results suggested that using a concept-based approach is significantly more
effective than using a traditional, skills-demonstration approach in the
teaching of information retrieval. This effectiveness was both in terms of
increasing knowledge of the search process; and in terms of improving search
outcomes. Further, results suggested that search strategy formulation is
significantly correlated with electronic database knowledge, and problemsolving
ability; and that problem-solving ability and level of cognitive
maturity may be related.
Results supported the two-stage model of the information retrieval process
suggested by the researcher as one possible construct of the thinking
processes underlying information retrieval.
These findings led to the implementation of Phase 2 of the research in 1999.
Subjects were 68 second-year undergraduate students at the University of
Canberra. In this Phase, concept-based teaching techniques were used to
develop four modules covering a range of information literacy skills,
including: critical thinking; information retrieval strategies; evaluation of
sources; and determining relevance of articles. Results confirmed that subjects
taught by methods based on leaming theory paradigms (the experimental
treatment group), were better able to design effective searches than subjects
who did not receive such instruction (the control treatment group). Further,
results suggested that these teaching methods encouraged experimental
group subjects to locate material from more credible sources than did control
group subjects.
These findings are of particular significance, given the increasing use of the
unregulated intemet environment as an information source.
Taking into account literature reviewed, and the results of Phases 1 and 2, a
model of the information retrieval process is proposed.
Finally, recognising the central importance of the acquisition of information
literacy to student success at university, and to productive membership of the
information society, a detailed framework for the teaching of information
literacy in higher education is suggested.
|
678 |
Les logiciels libres au sein des ministères françaisEnnifar, Dhakouane 22 October 2007 (has links) (PDF)
Cette thèse de recherche vient en complément des études suivies à l'école ESCP-EAP, à Paris, dans le cadre d'un mastère spécialisé en stratégie et management des systèmes d'information. Celui-ci fait suite à une formation d'ingénieur expert en systèmes et réseaux informatiques à l'Ecole Supérieure de Génie Informatique, à Paris. Ce document est divisé en trois parties à travers lesquelles il est présenté une analyse de l'intégration des logiciels libres au sein des ministères français et de leurs administrations. La première partie donne un aperçu du concept ainsi que du monde du libre et des logiciels qui le forment. Le but de cette partie est d'acquérir une meilleure compréhension du domaine " Open source ". On y trouve ainsi la présentation d'un rapide historique, d'une définition du terme et des caractéristiques des logiciels libres, des principaux enjeux qui résultent de cette technologie et de son cadre juridique, tous aspects qui constituent l'environnement dans lequel s'inscrit l'orientation des ministères vers le " libre ". La deuxième partie réunit le résultat d'une phase d'enquêtes menées auprès de différents ministères français, de leurs administrations et des responsables des directions des systèmes d'information respectifs. Il y est exposé les principales migrations vers les logiciels libres au sein de ces ministères avant de caractériser plus précisément le choix de cette reconversion, ses intérêts et ses éventuelles difficultés. La dernière partie de ce document s'attache à examiner les suites de changement. Inévitablement, des réactions antagonistes se manifestent mais qui, n'empêchent pas une réelle réflexion sur l'avenir des logiciels libres au sein des ministères. En tout état de cause, ceux-ci apparaissent comme un tremplin pour les logiciels libres dans les différents secteurs.
|
679 |
Similarity Search in High-dimensional Spaces with Applications to Time Series Data Mining and Information RetrievalMuhammad Fuad, Muhammad Marwan 22 February 2011 (has links) (PDF)
Nous présentons l'un des principaux problèmes dans la recherche d'informations et de data mining, ce qui est le problème de recherche de similarité. Nous abordons ce problème dans une perspective essentiellement métrique. Nous nous concentrons sur des données de séries temporelles, mais notre objectif général est de développer des méthodes et des algorithmes qui peuvent être étendus aux autres types de données. Nous étudions de nouvelles méthodes pour traiter le problème de recherche de similarité dans des espaces haut-dimensionnels. Les nouvelles méthodes et algorithmes que nous introduisons sont largement testés et ils montrent une supériorité sur les autres méthodes et algorithmes dans la littérature.
|
680 |
Similarity Search in High-dimensional Spaces with Applications to Time Series Data Mining and Information RetrievalMuhammad Fuad, Muhammad Marwan 22 February 2011 (has links) (PDF)
We present one of the main problems in information retrieval and data mining, which is the similarity search problem. We address this problem mainly from a metric perspective. We focus on time series data, but our general objective is to develop methods and algorithms that can be extended to other data types. We investigate new methods to handle the similarity search problem in high-dimensional spaces. The novel methods and algorithms we introduce are tested extensively and they show superiority over other methods and algorithms in the literature.
|
Page generated in 0.0952 seconds