11 |
Indeksering av heterogene XML dokumenter ved hjelp av datatyper fra XML Schema / Indexing heterogeneous XML Documents using Data Types from XML SchemaMyklebust, Trond Aksel January 2006 (has links)
<p>Denne masteroppgaven foreslår og undersøker en metode for hvordan informasjons-gjenfinning i heterogene XML dokumenter kan gjøres ved å differensiere indekserings-prosessen ut i fra datatyper angitt i tilhørende XML Schema. Målet er å tilby bedre søkemuligheter for informasjonssøkere ved å muliggjøre spørringer som er uavhengige av elementnavn i en samling av forskjellig strukturerte dokumenter. Informasjonssøking foregår i dag primært i ustrukturerte dokumenter der betydningen av innholdet ikke er direkte kjent. Dette krever kompliserte og unøyaktige tolkninger av innholdet for å kunne trekke ut hva som er hva og hvordan dokumentene best mulig kan indekseres. En stadig økende mengde produsert informasjon og metadata gjør dette til en krevende prosess å utføre manuelt. Det trengs derfor nye metoder der innholdet blir beskrevet ved produksjonstidspunktet slik at en datamaskin automatisk kan forstå dokumentenes innhold. Semistrukturerte dokumentformater som XML inneholder støtte for spesifisering av slik informasjon og muliggjør differensiert indeksering av innholdet basert på annotert informasjon. Dette gjør mer detaljerte spørringer enn tidligere mulig men stiller nye krav til de metoder som brukes for å indeksere dokumentene. En av de største utfordringene er å lokalisere og tolke den informasjonen som øker kvaliteten på resultatet av et søk uten at noe informasjon forsvinner. Informasjonen eksisterer ikke i en flat tekstfil, men inneholder distinkte datatyper som må behandles individuelt. Dette krever nye metoder som muliggjør indeksering basert på denne informasjonen. I denne oppgaven presenteres et forslag til et system som indekserer XML dokumenter ved å tolke tilhørende XML Schema inneholdende annotasjoner av datatype og dataformat. Ved å bruke for hvert element denne informasjonen er ønsket at indekseringen gjøres ved å automatisk normalisere elementinnholdet ut i fra angitt format og datatype. Søk kan dermed optimaliseres basert på datatype uavhengig av om originalt format og dokumentstruktur er forskjellig. Testing av systemet er gjennomført for å finne ut hvordan eksisterende XML dokumenter støtter denne typen indeksering og eventuelle løsninger for hvordan det kan gjøres bedre. Utkommet fra arbeidet på oppgaven og hovedkonklusjonen er at den foreslåtte metoden fungerer godt som løsning på problemstillingen, gitt at de eksterne data som brukes er strukturert slik at datatyper kan defineres for innholdet.</p>
|
12 |
Optimal Information Retrieval Model for Molecular Biology InformationPaulsen, Jon Rune January 2007 (has links)
<p>Search engines for biological information are not a new technology. Since the 1960s computers have emerged as an important tool for biologists. Online Mendelian Inheritance in Man (OMIM) is a comprehensive catalogue containing approximately 14 000 records with information about human genes and genetic disorders. An approach called Latent Semantic Indexing (LSI) was introduced in 1990 that is based on Singular Value Decomposition (SVD). This approach improved the information retrieval and reduced the storage requirements. This thesis applies LSI on the collection of OMIM records. To further improve the retrieval effectiveness and efficiency, the author propose a clustering method based on the standard k-means algorithm, called Two step k-means. Both the standard k-means and the Two step k-means algorithms are tested and compared with each other.</p>
|
13 |
Finding and Mapping Expertise Automatically Using Corporate DataVennesland, Audun January 2007 (has links)
<p>In an organization, both management as well as new and experienced employees often have a need to get in touch with experts in a variety of situations. The new staff members need to learn how to perform their job, the management need - amongst other things - to man projects and vacancies, and other employees are often dependent on others' expertise to accomplish their tasks. Traditionally this problem has often been approached with computer applications using semi-automatic methods involving self-assessments of expertise stored in databases. These methods prove to be time-consuming, they do not consider the dynamics of expertise and the self-assessed expertise is often difficult to validate. This report presents an overview of issues involved in expertise finding and the development of a simple, yet effective prototype which tries to overcome the mentioned problems by using a fully automatic approach. A study of the Urban Development area at the Municipality of Trondheim is carried out to analyze this organizations' possessed expertise, sought after expertise and to collect necessary information for building the expertise finder prototype. The study found that a lot of expertise evidence is found in the formal correspondence archived in the case handling systems' document repository, and that the structure and content of these documents could fit a fully-automatic Expertise finder well. Four alternative test cases have been evaluated during the testing and evaluation of the prototype. One of these test cases - where expert profiles are modelled on-the-fly based on employees' names occurring in formal documents - is able to compete with- and in some cases outperform evaluation scores presented in related research.</p>
|
14 |
Adaptive personalized eLearningTakhirov, Naimdjon January 2008 (has links)
<p>This work has found that mapping prior knowledge and learning style is important for constructing personalized learning offerings for students with different levels of knowledge and learning styles. Prior knowledge assessment and a learning style questionnaire were used to assess the knowledge level and learning style. The proposed model for automatic construction of prior knowledge assessment aims to connect questions in the assessment to speci c course modules in order to identify levels on different modules, because a student may have varying levels of knowledge within different modules. We have also found that it is not easy to map students' prior knowledge with total accuracy. However, this is not required in order to achieve a tailored learning experience; an assessment of prior knowledge can still be used to decide what piece of content should be presented to a particular student. Learning style can be simply de ned as either the way people learn or an individual's preferred way of learning. The VAK learning style inventory has been found suitable to map the learning styles of students, and it is one of few learning style inventories appropriate for online learning assessment. A questionnaire consisting of 16 questions has been used to identify the learning style of students prior to commencement of the course. It is important to consider the number of questions, because the students may feel reluctant to spend too much time on the questionnaire. However, the user evaluation has shown that students willingly answer questions to allow the system to identify their learning styles. This work also presents a comprehensive overview of the state-of-the-art pertaining to learning, learning styles, Learning Management Systems, technologies related to web-based personalization and related standards and speci cations. A brief comparison is also made of various schools that have tried to address personalization of content for web-based learning. Finally, for evaluation purposes, a course on "Designing Relational Databases" was created, and a group of fourteen users evaluated the personalized course.</p>
|
15 |
Ranking and clustering of search results : Analysis of Similarity graphShevchuk, Ksenia Alexander January 2008 (has links)
<p>Evaluate the clustering of the similarity matrix and confirm that it is high. Compare the ranking results of the eigenvector ranking and the Link Popularity ranking and confirm for the high clustered graph the correlation between those is larger than for the low clustered graph.</p>
|
16 |
Design and use of XML formats for the FRBR modelGjerde, Anders January 2008 (has links)
<p>This thesis aims to investigate how XML can be used to design a bibliographical format for storage of records better in terms of hierarchical structure and readability. It first presents introductory theory regarding the techniques which make the fundament of bibliographical formats and what has previously been in use. It also accounts for the FRBR model which is the conceptual framework of the format presented here. Throughout the thesis, several important XML design criteria will be presented and examples as to why these are important to consider when constructing a bibliographical format with the use of XML. Different implementation alternatives will be presented, with their advantages and disadvantages thoroughly discussed in order to establish a solid foundation for the choices that have been made. After having done this study, an XSD (XML Schema Definition) has been made according to the best practices that have been uncovered. The XSD is based on the FRBR Model, although it is slightly changed to accommodate the wishes and interests of librarians. Most noteworthy of these changes is that the Manifestation element has been made the top element with the Expression and Work elements hierarchically placed beneath Manifestation in that order. It maintains a MARC-based datatag structure, so that librarians who are already used to it will not have to readjust to another way of structuring the most common datafields. Relationships and other attributes however, are efficiently handled in language-based elements and the XSD accommodates new relationship types with a generic relation element. XSLT has been used to transform an existing XML database to conform to the XSD for testing purposes. Statistics have been collected from the database to support design choices. Depending on what the users' needs are, there are many different design choices. XML leads to more readable records but also takes up much space. When using XML to describe relational metadata, relationships can be expressed using hierarchical storing to a certain degree, but ID/IDREF will have to be used at some point to avoid infinite inclusion of new records. ID/IDREF may also be used to improve readability or save storage space. Hierarchical storing leads to many duplicated records, especially concerning Actors and Concepts. When using XML, one must choose the root element of the record structure according to which entity is the point of interest. In FRBR, there are several reasons to choose Manifestation as the root element as it is the focal point of a library record.</p>
|
17 |
Phrase searching in text indexesFellinghaug, Asbjørn Alexander January 2008 (has links)
<p>Phrase searching in text indexes Compare different approaches to perform phrase searching, and consider a new approach whereas bigrams is considered as index term. This master thesis focus at the challenges within phrase searching in large text indexes, and to assess alternative approaches to cope with such indexes. This goal was achieved by performing an experiment, based on the theory of using bigrams consisting of stopwords as additional index terms. Realizing the characteristics within inverted index structures, we utilized stopwords as indicators for severe long posting lists. The characteristics of stopwords proved valuable, and they were collected based on a already established index for a subset of the TREC GOV2 collection. In alternative approaches we outlined two state of the art index structures, specifically designed to cope with phrase searching challenges. The first structure - nextword index - followed a modification of the inverted index structure. The second structure - phrase index - utilized the inverted structure in using complete phrases as index terms. Our bigram index focused on the same manipulation of the inverted index structure as the phrase index, using bigrams of words to rastically cut posting lists lengths. This was one of our main goals, as we identified stopwords posting list lengths to be one of the primary challenges with phrase searching in inverted index structures. Using stopwords to create and select bigrams proved successful to enhance phrase searching, as response times substantially improved. We conclude that our bigram index provides a significant performance in crease in terms of query evaluation time, and outperforms the standard inverted index within phrase searching.</p>
|
18 |
A Multimedia Approach to Medical Information RetrievalGrande, Aleksander January 2009 (has links)
<p>From the discovery of DNA by Francis H. C. Crick and James D. Watson in 1953 there have been conducted a lot of research in the field of DNA. Up through the years technological breakthroughs has made DNA sequencing more faster and more available, and has gone from being a very manual task to be highly automated. In 1990 the Human Genome Project was started and the research in DNA skyrocketed. DNA was sequenced faster and faster throughout the 1990s, and more projects with goals of sequencing other specie's DNA was initiated. All this research of DNA led to vast amounts of DNA sequences, but the techniques for searching through these DNA sequences was not developed at the same pace. The need for new and improved methods of searching in DNA is becoming more and more evident. This thesis explores the possibilities of using content based information retrieval to search through DNA sequences. This is a bold proposition but can have great benefits if successfully implemented. By transforming DNA sequences to images, and indexing these images with a content based information retrieval system it can be possible to achieve a successful DNA search. We discover that this is possible but further work has to be done in order to solve some discovered issues with the transforming of the DNA sequences to images.</p>
|
19 |
Redistribution of Documents across Search Engine ClustersHøyum, Øystein January 2009 (has links)
<p>The goal of this master thesis has been to evaluate methods for redistribution of data on search engine clusters. For all of the methods the redistribution is done when the cluster changes size. Redistribution methods that are specifically designed for search engines are not common, so the methods compared in this thesis are based on other distributed settings. This is from among other things distributed database systems, distributed files and continuous media systems. The evaluation of the methods consists of two parts, a theoretical analysis and an implementation and testing of the methods. In the theoretical analysis the methods are compared by deduction of expressions of performance. In the practical approach the algorithms are implemented on a simplified search engine cluster of 6 computers. The methods have been evaluated using three criteria. The first criteria of evaluation are how well the methods distribute documents across the cluster. In the theoretical analysis this also includes worst case scenarios. The practical evaluation compares the distribution at the end of the tests. The second criterion of evaluation is efficiency of document access. The theoretical approach focuses on the number of operations required while the practical approach calculates indexing throughput. The last area of focus examined is the document volume transported during redistribution. For the final part of the comparison of the methods, some relevant scenarios are introduced. These scenarios focus on dynamic data sets with high frequency of updates, often new documents and much searching. Using the scenarios and results from the method testing, we found some methods that performed be better than others. It is worth noting that the conclusions are for a given the type of workload from the scenarios and the setting for the test. Given other situations, other methods might be more suitable. When concluding our results we found, for the give scenarios, the best distribution method was the distributed version of linear hashing (LH*). The results from the method using hashing/range-partitioning also showed to be the least suitable as a consequence of high transport volume.</p>
|
20 |
Biomedical Information Retrieval based on Document-Level Term BoostingJohannsson, Dagur Valberg January 2009 (has links)
<p>There are several problems regarding information retrieval on biomedical information. The common methods for information retrieval tend to fall short when searching in this domain. With the ever increasing amount of information available, researchers have widely agreed on that means to precisely retrieve needed information is vital to use all available knowledge. We have in an effort to increase the precision of retrieval within biomedical information created an approach to give all terms in a document a context weight based on the contexts domain specific data. We have created a means of including our context weights in document ranking, by combining the weights with existing ranking models. Combining context weights with existing models has given us document-level term boosting, where the context of the queried terms within a document will positively or negatively affect the documents ranking score. We have tested out our approach by implementing a full search engine prototype and evaluatied it on a document collection within biomedical domain. Our work shows that this type of score boosting has little effect on overall retrieval precision. We conclude that the approach we have created, as implemented in our prototype, not to necessarily be good means of increasing precision in biomedical retrieval systems.</p>
|
Page generated in 0.1364 seconds