111 |
Feature Selection for Text CategorisationGarnes, Øystein Løhre January 2009 (has links)
Text categorization is the task of discovering the category or class text documents belongs to, or in other words spotting the correct topic for text documents. While there today exists many machine learning schemes for building automatic classifiers, these are typically resource demanding and do not always achieve the best results when given the whole contents of the documents. A popular solution to these problems is called feature selection. The features (e.g. terms) in a document collection are given weights based on a simple scheme, and then ranked by these weights. Next, each document is represented using only the top ranked features, typically only a few percent of the features. The classifier is then built in considerably less time, and might even improve accuracy. In situations where the documents can belong to one of a series of categories, one can either build a multi-class classifier and use one feature set for all categories, or one can split the problem into a series of binary categorization tasks (deciding if documents belong to a category or not) and create one ranked feature subset for each category/classifier. Many feature selection metrics have been suggested over the last decades, including supervised methods that make use of a manually pre-categorized set of training documents, and unsupervised methods that need only training documents of the same type or collection that is to be categorized. While many of these look promising, there has been a lack of large-scale comparison experiments. Also, several methods have been proposed the last two years. Moreover, most evaluations are conducted on a set of binary tasks instead of a multi-class task as this often gives better results, although multi-class categorization with a joint feature set often is used in operational environments. In this report, we present results from the comparison of 16 feature selection methods (in addition to random selection) using various feature set sizes. Of these, 5 were unsupervised , and 11 were supervised. All methods are tested on both a Naive Bayes (NB) classifier and a Support Vector Machine (SVM) classifier. We conducted multi-class experiments using a collection with 20 non-overlapping categories, and each feature selection method produced feature sets common for all the categories. We also combined feature selection methods and evaluated their joint efforts. We found that the classical supervised methods had the best performance, including Chi Square, Information Gain and Mutual Information. The Chi Square variant GSS coefficient was also among the top performers. Odds Ratio showed excellent performance for NB, but not for SVM. The three unsupervised methods Collection Frequency, Collection Frequency Inverse Document Frequency and Term Frequency Document Frequency all showed performances close to the best group. The Bi-Normal Separation metric produced excellent results for the smallest feature subsets. The weirdness factor performed several times better than random selection, but was not among the top performing group. Some combination experiments achieved better results than each method alone, but the majority did not. The top performers Chi square and GSS coefficient classified more documents when used together than alone.Four of the five combinations that showed increase in performance included the BNS metric.
|
112 |
Semantic Cache Investment : Adaption of Cache Investment for DASCOSABeiske, Konrad Giæver, Bjørndalen, Jan January 2009 (has links)
Semantic cache and distribution introduce new obstacles to how we use cache in query processing in databases. We have adapted a caching strategy called cache investment to work in a peer-to-peer database with semantic cache. Cache investment is a technique that influences the query optimizer without changing it. It suggests cache candidates based on knowledge about queries executed in the past. These queries are not only limited to the local site, but also detects locality in queries by looking at queries processed on remote sites. Our implementation of Semantic cache investment for distributed databases shows a great performance improvement, especially when multiple queries are active at the same time. To utilize cache investment we have looked into how a distributed query optimizer can be extended to use cache content in planning. This allows the query optimizer to detect and include beneficial cache content on remote sites that it otherwise would have ignored. Our implementation of a cache-aware optimizer shows an improvement in performance, but its most important task is to evaluate cache candidates provided through cache investment.
|
113 |
Prototyping a location aware application for UBiT. : A map-based application, designed, implemented and evaluated.Olsen, Bjarne Sletten January 2009 (has links)
Through the research performed in this thesis, it has been shown how location awareness and maps can be exploited to facilitate the use of library resources, such as information on documents and objects. A prototype has been developed to demonstrate the feasibility of integrating several different information sources for this use. The prototype created allows for users located within the city centre of Trondheim to search for documents and to locate the library departments holding them. The user is shown a map and given information on how to travel to the nearest bus stop, as well as bus schedules on how to get to the selected library department. Several information sources for the prototype has been identified and evaluated. The prototype communicates with BIBSYS for document information retrieval, Google Maps for map generation, team-trafikk.no for bus schedules querying and Amazon.com and LibraryThing.com for book cover image downloading. To ensure data consistency some local data sources are also maintained, such as a list of all the UBiT (NTNU library) departments in Trondheim. The prototype was implemented so that it would satisfy a set of requirements. These requirements were created by applying the technique of use cases. Each requirement has been discussed and prioritised based on requests from UBiT. The most important requirements have been incorporated into the design of the prototype. This focuses on modularity and it has been discussed how the external sources best can be integrated with the prototype. The prototype is implemented using a combination of programming languages. The differences between these languages have posed a challenge, and solutions to how these can be avoided are presented. The prototype has been tested according to an extensive test plan, and the results of these tests have been document and evaluated. Each of the design decisions have been evaluated and discussed, and suggestions on how these could have been improved are given. Finally, suggestions on how the functionality of the prototype can be extended are presented. The prototype created in this thesis allows for users, familiar or unfamiliar with the city and its transportation network, to locate a document and travel to the library holding it. It demonstrates how emerging technologies such as location awareness can contribute to increased use of library services.
|
114 |
Similarity Search in Large Databases using Metric Indexing and Standard Database Access MethodsOttesen, Erik Bagge January 2009 (has links)
Several methods exists for performing similarity searches quickly using metric indexing. However, most of these methods are based on main memory indexing or require specialized disk access methods. We have described and implemented a method combining standard database access methods with the LAESA Linear Approximating Eliminating Search Algorithm to perform both range and K nearest neighbour (KNN) queries using standard database access methods and relational operators. We have studied and tested various existing implementations of R-trees, and implemented the R*-tree. We also found that some of the optimizations in R*-trees was damaging to the response time at very high dimensionality. This is mostly due to the increased CPU time removing any benefit from reducing the number of disk accesses. Further we have performed comprehensive experiments using different access methods, join operators, pivot counts and range limits for both range and nearest neighbour queries. We will also implement and experiment using a multi-threaded execution environment running on several processors.
|
115 |
Algorithms and Approaches for Configuration-less Log AnalysisEllingsæter, Jenny Marie, Sandholtbråten, Frode January 2009 (has links)
System logs contain messages from a wide range of applications. They are the natural starting point when troubleshooting a system. The usual approach for analysing system logs is to write a number of regular expressionsto match specific keywords and events. When the number of expressions grows large, the analysis solution becomes unmaintainable. In addition, the use of regular expressions requires the system administrator to have extensive knowledge of the system at hand. This thesis presents methods for performing log analysis without regular expressions. This is an area of system administration that has attracted very few researchers. Therefore, little published research is available on the subject. Much effort has been put into the task of generating patterns from log file. These patterns are an important prerequisites for statistical analysis. Patterns could also be used to identify transactions for use in Markov models. None of the existing pattern mining algorithm for system logs produce satisfactory results. To solve the task at hand, a new method for mining patterns is developed. Several different approaches were tested. An approach based on inserting log lines into a tree structure turned out to be a very promising. It outputs good quality patterns and its resource use is moderate. Log analysis without prior knowledge of the system at hand have been proven difficult. This thesis shows that methods where some basic knowl- edge of systems in general is exploited, are the most promising ones. Other approaches based on Markov models and neural networks are suggested in this thesis, but they have not been tested to full extend and require some more work before being useful.
|
116 |
iAD: Query Optimization in MARSBrasetvik, Alex, Norheim, Hans Olav January 2009 (has links)
This document is the report for the authors' joint effort in researching and designing a query optimizer for Fast's next-generation search platform, known as MARS. The work was done during our master's thesis at the Department of Computer and Information Science at the Norwegian University of Science and Technology, spring 2009. MARS does not currently employ any form of query optimizer, but does have a parser and a runtime system. The report therefore focuses on the core query optimizing aspects, like plan generation and optimizer design. First, we give an introduction to query optimizers and selected problems. Then, we describe previous and ongoing efforts regarding query optimizers, before shifting focus to our own design and results. MARS supports DAG-structured query plans for more efficient execution, which means that the optimizer must do so too. This turned out to be a greater task than what it might seem like --- since we must use algorithms that greatly differ from the optimizers we were familiar with. The optimizer also needed to be extensible, including the ability to deal with future query operators, as well as supporting arbitrary cost models. During the course of the master's thesis, we have laid out the design of an optimizer that satisfies these goals. The optimizer is able recognize common sub-expressions and construct DAGs from non-DAG inputs. Extensibility is solved by loose coupling between optimizer components. Rules are used to model operators, and the cost model is a separate, customizable component. We have also implemented a prototype that demonstrates that the design actually works. The optimizer itself is designed as separate component, not tied up to MARS. We have been able to inject it into the MARS query pipeline and run queries end-to-end with optimization enabled, improving the query evaluation time. For now, the project depends on MARS assemblies, but reusing it for another engine and algebra is entirely feasible.
|
117 |
Experience transfer in professional networks in Statoil : The use of information technologyUlven, Mette January 2006 (has links)
With competition increasing in the market, companies are seeking new ways to sustain and enhance their effeciency and competitiveness. In this regard, companies have recently been focusing on knowledge as a competitive resource, and it has become a challenge for organisations to locate and share their knowledge. Many different approaches to managing knowledge exist, but the two extreme points are facial interaction and the use of information technology (IT). When people in an organisation are co-located, they can interact on a frequent basis to learn from each other. This is by citet{REF52} referred to as Communities-of-Practice (CoP), where knowledge is shared in its natural context, for example through story telling. In large organisations where people are geographically spread out however, IT is considered to be helpful for connecting people and spreading knowledge.In Statoil, professional networks are established to enable experience transfer between network members. Network members can both be co-located in the organisation or geographically spread out, and I have therefore looked at how experiences are transferred between network members. Special attention has been paid to the value of IT for connecting network members who are spread out geographically. Throughout this report I have argued that professional networks are similar to CoPs, and they are continuously being compared. To find information about professional networks, I conducted an empirical study where network members and leaders from three professional networks were interviewed.
|
118 |
Benchmarking significant DBMS costs on Niagara in order to perform a relative performance comparison between the Shared Nothing and the Shared Everything DBMS memory architecturesBjørk, Lars-Erik, Jørgensen, Truls Rinnan January 2006 (has links)
This report carries out a relative performance comparison between two DBMS architectures on the Multi Core, Single Die (MCSD) realization Niagara. The two DBMS architectures in question are Shared Nothing (SN) and Shared Everything (SE). The MCSD field is rapidly evolving, and we expect that this technology will become increasingly important in the near future. In order to carry out the comparison, the performance of the architectures must be calculated. This calculation depends on the cost figures associated with each architectural approach. To identify these costs, we present the design solutions made and results discovered in our previous work. Based on this, the most significant costs are determined and scheduled to be micro benchmarked. The natural next step is to examine possible techniques to implement the benchmarks. In order to do this, we first expand on the Niagara chip and the platform on which the micro benchmarks will run. Having a sufficient theoretical platform to continue, we move on to describe the implementation of each micro benchmark in detail. After benchmarking all the most significant costs, we thoroughly discuss the results, some of which are indeed surprising. The costs which are not benchmarked are based on assumptions from our previous work and recalculated to apply to Niagara. For both SN and SE, we evaluate the system for two classes of transactions. The first class is transactions touching one tuple (called simple), the second is transactions touching four tuples (called complex). Each class has two instances, read and update. In order to perform the subsequent analysis, the decomposition of each transaction is presented in detail. When analyzing the outcome of the calculations, interesting results emerge. First, we note that SE is the cheapest alternative when evaluating the simple transactions. This is because the SN approach includes an administrative overhead component that does not pay off when the transaction only touches one tuple. However, for complex transactions, the overhead component results in a parallel gain for SN which outperforms SE. Based on the most dominant costs of both architectures, we perform a sensitivity analysis. For SN, the analysis is based on the cost for message passing. For SE, it is based on the cost for synchronization. The goal of this analysis is two folded. First, it is interesting to see how the results vary. For example, what the ratio between the cost for message passing and the cost for synchronization must be in order to make the two approaches perform equally well. Second, the analysis indicate how error-prone each architecture is to erroneous estimation. The sensitivity analysis examine the performance of SN and SE when the ratio between the cost for message passing and the cost for synchronization is varied. This is done in both the read and the update cases. In addition to examining the simple and the complex transactions, we examine general transactions were the number of operations are not predetermined. The analysis of the general read transaction suggests that when the number of operations increases, the message passing and synchronization costs wipe out the impact of the other costs. It also suggests that when the cost of message passing is greater than 4 times the cost of synchronization, SE performs better when increasing the number of read operations. Similarly, if message passing is cheaper than 4 times the cost of synchronizing, SN is preferable. When increasing the number of update operations, the ratio is 3.33. After concluding the analysis, we suggest a hybrid architecture that might combine the advantages of SN and SE. At the cost of introducing both message passing and synchronization, the architecture introduce parallelism in SE. Lastly, we identify suggestions for future work. Realized and applied to the DBMS model introduced in this report, we believe that several of these suggestions can shrink some of the costs presented.
|
119 |
Development of a Demand Driven Dom ParserAlvestad, Gaute Odin, Gausnes, Ole Martin, Kråkenes, Ole-Jakob January 2006 (has links)
XML is a tremendous popular markup language in internet applications as well as a storage format. XML document access is often done through an API, and perhaps the most important of these is the W3C DOM. The recommendation from W3C defines a number of interfaces for a developer to access and manipulate XML documents. The recommendation does not define implementation specific approaches used behind the interfaces. A problem with the W3C DOM approach however, is that documents often are loaded in to memory as a node tree of objects, representing the structure of the XML document. This tree is memory consuming and can take up to 4-10 times the document size. Lazy processing have been proposed, building the node tree as it accesses new parts of the document. But when the whole document has been accessed, the overhead compared to traditional parsers, both in terms of memory usage and performance, is high. In this thesis a new approach is introduced. With the use of well known indexing schemes for XML, basic techniques for reducing memory consumption, and principles for memoryhandling in operation systems, a new and alternative approach is introduced. By using a memory cache repository for DOM nodes and simultaneous utilize principles for lazy processing, the proposed implementation has full control over memory consumption. The proposed prototype is called Demand Driven Dom Parser, D3P. The proposed approach removes least recently used nodes from the memory when the cache has exceeded its memory limit. This makes the D3P able to process the document with low memory requirements. An advantage with this approach is that the parser is able to process documents that exceed the size of the main memory, which is impossible with traditional approaches. The implementation is evaluated and compared with other implementations, both lazy and traditional parsers that builds everything in memory on load. The proposed implementation performs well when the bottleneck is memory usage, because the user can set the desired amount of memory to be used by the XML node tree. On the other hand, as the coverage of the document increases, time spend processing the node tree grows beyond what is used by traditional approaches.
|
120 |
Multi-Formalism Modelling of a Submarine Combat System Test Facility: an Application of DEVSSkogstad, Kjell-Inge January 2006 (has links)
This thesis aims at applying and exploring the DEVS-based theory for the purpose of gaining experiences and recommendations with regards to the usefulness of DEVS in the analysis of a submarine combat system and in establishing simulation credibility. In this regard, first a literature study of the DEVS-based literature is performed, before a case study is carried out, targeting a subset of the submarine combat system test bed under construction at Forsvarets forskningsinstitutt (FFI). In doing so, an architectural description of the subset based on DEVS is created and special requirements are discussed with regards to legacy components and technologies. Finally, the usefulness of DEVS is discussed based on experiences made during the first two tasks. Note that implementation and aspects with regards to an executable framework for the simulator is not covered. The findings of this study indicates that DEVS with its formal nature can be a valuable tool both with regards to analysis as well as in establishing credibility. Especially with regards to couplings and composability DEVS might prove helpful. The formal specifications and definitions can ensure that consistency is achieved. A formal experimental frame also ensures an unambiguous foundation for creating the simulation. With this in mind, an issue was discovered with regards to how far DEVS should go in covering aspects which can be argued to be part of the simulator within the abstract model. A solution involving the use of one or more lumped models is suggested, but need further study. Finally, the need for a proper tools and a graph notation is emphasised if DEVS is to be practical in complex simulations.
|
Page generated in 0.0949 seconds