Return to search

Improving information retrieval for the biomedical sciences : a case study on mouse knockout literature

Given the growth in accessible online research literature it is vital that biologists are enabled to find articles with relevant information to support their investigations. This thesis details research aimed at improving the sensitivity of information retrieval to topics which are contained in articles that have a different overall topic. Fetal growth and development (FGD) information contained in articles reporting mouse gene knockout studies was used as the exemplar domain throughout. The research began by assessing the characteristics of the mouse knockout and FGD domains. Two leading domain experts were consulted to develop a definition ofFGD information that could be used as the basis for experimentation. 4 other domain experts then annotated FGD information on a sample set of 20 articles according to the provided definition to verify its presence in the mouse knockout literature and to elucidate topics and words associated with the FGD information. Subsequently, numerical and lexical methods were assessed in separate tests for the purpose of detecting FGD topics in the mouse knockout articles. The numerical methods could not be made to differentiate between the various topics, while the lexical methods were impractical to implement to describe the niche FGD domain. Different passage retrieval methods based on the Extended Boolean model were then investigated for their sensitivity to FGD information within the articles. Passages were defined as either whole articles or the titled sections appearing within articles or as 100-word non-overlapping windows. For each definition, articles were ranked according to their highest scoring passage in answer to a given query. An experimental methodology was implemented to test each passage retrieval method's relative sensitivity to FGD information based on the judgement of domain experts, quantified using the Discounted Cumulative Gain measure. The retrieval method that uses titled sections as passages performed best in 4 ofthe 5 assessed queries. An experimental web-based information retrieval system was developed, based around a relational database which stored a custom corpus of20715 mouse knockout HTML articles, as defined by the MEDLINE database. An automated heuristical procedure to identify passages from sections from HTML articles was developed, which was effective on scientific articles from a large variety of publishers. An examination of the corpus was also undertaken to elicit the domain's characteristics for use in future IR methodologies.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:491483
Date January 2008
CreatorsFarhan, Reyhood
PublisherUniversity of Manchester
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0056 seconds