Spelling suggestions: "subject:"MIT informatik""
31 |
Design and use of XML formats for the FRBR modelGjerde, Anders January 2008 (has links)
This thesis aims to investigate how XML can be used to design a bibliographical format for storage of records better in terms of hierarchical structure and readability. It first presents introductory theory regarding the techniques which make the fundament of bibliographical formats and what has previously been in use. It also accounts for the FRBR model which is the conceptual framework of the format presented here. Throughout the thesis, several important XML design criteria will be presented and examples as to why these are important to consider when constructing a bibliographical format with the use of XML. Different implementation alternatives will be presented, with their advantages and disadvantages thoroughly discussed in order to establish a solid foundation for the choices that have been made. After having done this study, an XSD (XML Schema Definition) has been made according to the best practices that have been uncovered. The XSD is based on the FRBR Model, although it is slightly changed to accommodate the wishes and interests of librarians. Most noteworthy of these changes is that the Manifestation element has been made the top element with the Expression and Work elements hierarchically placed beneath Manifestation in that order. It maintains a MARC-based datatag structure, so that librarians who are already used to it will not have to readjust to another way of structuring the most common datafields. Relationships and other attributes however, are efficiently handled in language-based elements and the XSD accommodates new relationship types with a generic relation element. XSLT has been used to transform an existing XML database to conform to the XSD for testing purposes. Statistics have been collected from the database to support design choices. Depending on what the users' needs are, there are many different design choices. XML leads to more readable records but also takes up much space. When using XML to describe relational metadata, relationships can be expressed using hierarchical storing to a certain degree, but ID/IDREF will have to be used at some point to avoid infinite inclusion of new records. ID/IDREF may also be used to improve readability or save storage space. Hierarchical storing leads to many duplicated records, especially concerning Actors and Concepts. When using XML, one must choose the root element of the record structure according to which entity is the point of interest. In FRBR, there are several reasons to choose Manifestation as the root element as it is the focal point of a library record.
|
32 |
Phrase searching in text indexesFellinghaug, Asbjørn Alexander January 2008 (has links)
Phrase searching in text indexes Compare different approaches to perform phrase searching, and consider a new approach whereas bigrams is considered as index term. This master thesis focus at the challenges within phrase searching in large text indexes, and to assess alternative approaches to cope with such indexes. This goal was achieved by performing an experiment, based on the theory of using bigrams consisting of stopwords as additional index terms. Realizing the characteristics within inverted index structures, we utilized stopwords as indicators for severe long posting lists. The characteristics of stopwords proved valuable, and they were collected based on a already established index for a subset of the TREC GOV2 collection. In alternative approaches we outlined two state of the art index structures, specifically designed to cope with phrase searching challenges. The first structure - nextword index - followed a modification of the inverted index structure. The second structure - phrase index - utilized the inverted structure in using complete phrases as index terms. Our bigram index focused on the same manipulation of the inverted index structure as the phrase index, using bigrams of words to rastically cut posting lists lengths. This was one of our main goals, as we identified stopwords posting list lengths to be one of the primary challenges with phrase searching in inverted index structures. Using stopwords to create and select bigrams proved successful to enhance phrase searching, as response times substantially improved. We conclude that our bigram index provides a significant performance in crease in terms of query evaluation time, and outperforms the standard inverted index within phrase searching.
|
33 |
A Multimedia Approach to Medical Information RetrievalGrande, Aleksander January 2009 (has links)
From the discovery of DNA by Francis H. C. Crick and James D. Watson in 1953 there have been conducted a lot of research in the field of DNA. Up through the years technological breakthroughs has made DNA sequencing more faster and more available, and has gone from being a very manual task to be highly automated. In 1990 the Human Genome Project was started and the research in DNA skyrocketed. DNA was sequenced faster and faster throughout the 1990s, and more projects with goals of sequencing other specie's DNA was initiated. All this research of DNA led to vast amounts of DNA sequences, but the techniques for searching through these DNA sequences was not developed at the same pace. The need for new and improved methods of searching in DNA is becoming more and more evident. This thesis explores the possibilities of using content based information retrieval to search through DNA sequences. This is a bold proposition but can have great benefits if successfully implemented. By transforming DNA sequences to images, and indexing these images with a content based information retrieval system it can be possible to achieve a successful DNA search. We discover that this is possible but further work has to be done in order to solve some discovered issues with the transforming of the DNA sequences to images.
|
34 |
Redistribution of Documents across Search Engine ClustersHøyum, Øystein January 2009 (has links)
The goal of this master thesis has been to evaluate methods for redistribution of data on search engine clusters. For all of the methods the redistribution is done when the cluster changes size. Redistribution methods that are specifically designed for search engines are not common, so the methods compared in this thesis are based on other distributed settings. This is from among other things distributed database systems, distributed files and continuous media systems. The evaluation of the methods consists of two parts, a theoretical analysis and an implementation and testing of the methods. In the theoretical analysis the methods are compared by deduction of expressions of performance. In the practical approach the algorithms are implemented on a simplified search engine cluster of 6 computers. The methods have been evaluated using three criteria. The first criteria of evaluation are how well the methods distribute documents across the cluster. In the theoretical analysis this also includes worst case scenarios. The practical evaluation compares the distribution at the end of the tests. The second criterion of evaluation is efficiency of document access. The theoretical approach focuses on the number of operations required while the practical approach calculates indexing throughput. The last area of focus examined is the document volume transported during redistribution. For the final part of the comparison of the methods, some relevant scenarios are introduced. These scenarios focus on dynamic data sets with high frequency of updates, often new documents and much searching. Using the scenarios and results from the method testing, we found some methods that performed be better than others. It is worth noting that the conclusions are for a given the type of workload from the scenarios and the setting for the test. Given other situations, other methods might be more suitable. When concluding our results we found, for the give scenarios, the best distribution method was the distributed version of linear hashing (LH*). The results from the method using hashing/range-partitioning also showed to be the least suitable as a consequence of high transport volume.
|
35 |
Biomedical Information Retrieval based on Document-Level Term BoostingJohannsson, Dagur Valberg January 2009 (has links)
There are several problems regarding information retrieval on biomedical information. The common methods for information retrieval tend to fall short when searching in this domain. With the ever increasing amount of information available, researchers have widely agreed on that means to precisely retrieve needed information is vital to use all available knowledge. We have in an effort to increase the precision of retrieval within biomedical information created an approach to give all terms in a document a context weight based on the contexts domain specific data. We have created a means of including our context weights in document ranking, by combining the weights with existing ranking models. Combining context weights with existing models has given us document-level term boosting, where the context of the queried terms within a document will positively or negatively affect the documents ranking score. We have tested out our approach by implementing a full search engine prototype and evaluatied it on a document collection within biomedical domain. Our work shows that this type of score boosting has little effect on overall retrieval precision. We conclude that the approach we have created, as implemented in our prototype, not to necessarily be good means of increasing precision in biomedical retrieval systems.
|
36 |
Personlige samlinger i distribuerte - digitale bibliotekJoki, Sverre Magnus Elvenes January 2004 (has links)
No description available.
|
37 |
Integrasjon og bruk av gazetteers og tesauri i digitale bibliotek.Søk og gjennfinning via geografisk refert informasjonOlsen, Marit January 2004 (has links)
-
|
38 |
Classification of Images using Color, CBIR Distance Measures and Genetic Programming : An evolutionary ExperimentEdvardsen, Stian January 2006 (has links)
In this thesis a novel approach to image classification is presented. The thesis explores the use of color feature vectors and CBIR retrieval methods in combination with Genetic Programming to achieve a classification system able to build classes based on training sets, and determine if an image is a part of a specific class or not. A test bench has been built, with methods for extracting color features, both segmented and whole, from images. CBIR distance-algorithms have been implemented, and the algorithms used are histogram Euclidian distance, histogram intersection distance and histogram quadratic distance. The genetic program consists of a function set for adjusting weights which corresponds to the extracted feature vectors. Fitness of the individual genomes is measured by using the CBIR distance algorithms, seeking to minimize the distance between the individual images in the training set. A classification routine is proposed, utilizing the feature vectors from the image in question, and weights generated in the genetic program in order to determine if the image belongs to the trained class. A testset of images is used to determine the accuracy of the method. The results shows that it is possible to classify images using this method, but that it requires further exploration to make it capable of good results.
|
39 |
Supporting SAM: Infrastructure Development for Scalability Assessment of J2EE SystemsBostad, Geir January 2006 (has links)
The subject of this master thesis is the exploration of scalability for large enterprise systems. The Scalability Assessment Method (SAM) is used to analyse scalability properties of an Internet banking application built on J2EE architecture. The report first explains the underlying concepts of SAM. A practical case study is then presented which walks through the stages of applying the method. The focus is to discover and where possible to supply the infrastructure necessary to support SAM. The practical results include a script toolbox to automate the measurement process and some investigation of key scalability issues. A further contribution is the detailed guidance contained in the report itself on how to apply the method. Finally conclusions are drawn with respect to the feasibility of SAM in the context of the case study, and more broadly for similar applications.
|
40 |
Identifying Duplicates : Disambiguating BibsysMyrhaug, Kristian January 2007 (has links)
The digital information age has brought with it the information seekers. These seekers, which are ordinary people, are one step ahead of many libraries, and require all information to be retrievable by posting a query and/or by browsing through information related to their information needs. Disambiguating (identifying and managing ambiguous entries) creators of publications, makes it browsing in information related to a specified creator feasible. This thesis pose a framework, named iDup, for disambiguation of bibliographic information, and evaluates the original edit-distance and a specially designed time-frame measure for comparing entries in a collection of BIBSYS-MARC records. The strength of the time-frame measure and edit-distance are both shown, as is the weakness of the edit-distance.
|
Page generated in 0.0727 seconds