Return to search

Parallel rule induction

Classification rule induction on large datasets is a major challenge in the field of data mining in a world where massive amounts of data are recorded on a large scale. There are two main approaches to classification rule induction; the 'divide and conquer' approach and the 'separate and conquer' approach. Even though both approaches deliver a comparable classification accuracy, they differ when it comes to rule representation and quality of rules in certain circumstances. There is the intuitive representation of classification rules in the form of a tree when using the 'divide and conquer' approach which is easy to assimilate by humans. However, modular rules induced by the 'separate and conquer' approach generally perform better in environments where the training data of the classifier is noisy or contains clashes. The term 'modular rules' is used to mean any set of rules describing some domain of interest. They will generally not fit together naturally in a decision tree. Both approaches are challenged by increasingly large volumes of data. There have been several attempts to scale up the 'divide and conquer' approach, however there is very little work on scaling up the 'separate and conquer' approach. One general approach is to use supercomputers with faster hardware to process these huge amounts of data, yet modest-sized organisations may not be able to afford such hardware. However most organisations have local computer workstations that they use for many applications such as word processing or spreadsheets. These computer workstations are usually connected in a local network and mainly used during normal working hours and are usually idle overnight and at weekends. During these idle times these computer workstations connected in a network could be used for data mining applications on large datasets. This research focuses on a cheap solution for modest sized organisations that cannot afford fast supercomputers. For this reason this work aims to utilise the computational power and memory of a network of workstations. In this research a novel framework for scaling up modular classification rule induction is presented, based on a distributed blackboard architecture. The framework is called PMCRI (Parallel Modular Classification Rule Inducer). It provides an underlying communication infrastructure for parallelising a whole family of modular classification rule induction algorithms: the Prism family. Experimental results obtained show a good scale up behaviour on various datasets and thus confirm the success of PMCRI.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:508872
Date January 2009
CreatorsStahl, Frederic Theodor
PublisherUniversity of Portsmouth
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.002 seconds