91 |
Bayesian classification of DNA barcodesAnderson, Michael P. January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Suzanne Dubnicka / DNA barcodes are short strands of nucleotide bases taken from the cytochrome c oxidase
subunit 1 (COI) of the mitochondrial DNA (mtDNA). A single barcode may have the form C
C G G C A T A G T A G G C A C T G . . . and typically ranges in length from 255 to around
700 nucleotide bases. Unlike nuclear DNA (nDNA), mtDNA remains largely unchanged as
it is passed from mother to offspring. It has been proposed that these barcodes may be
used as a method of differentiating between biological species (Hebert, Ratnasingham, and
deWaard 2003). While this proposal is sharply debated among some taxonomists (Will
and Rubinoff 2004), it has gained momentum and attention from biologists. One issue
at the heart of the controversy is the use of genetic distance measures as a tool for species differentiation. Current methods of species classification utilize these distance measures that are heavily dependent on both evolutionary model assumptions as well as a clearly defined "gap" between intra- and interspecies variation (Meyer and Paulay 2005). We point out the limitations of such distance measures and propose a character-based method of species classification which utilizes an application of Bayes' rule to overcome these deficiencies. The proposed method is shown to provide accurate species-level classification. The proposed methods also provide answers to important questions not addressable with current methods.
|
92 |
MapReduce network enabled algorithms for classification based on association rulesHammoud, Suhel January 2011 (has links)
There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. This thesis introduces a new MapReduce based association rule miner for extracting strong rules from large datasets. This miner is used later to develop a new large scale classifier. Also new MapReduce simulator was developed to evaluate the scalability of proposed algorithms on MapReduce clusters. The developed associative rule miner inherits the MapReduce scalability to huge datasets and to thousands of processing nodes. For finding frequent itemsets, it uses hybrid approach between miners that uses counting methods on horizontal datasets, and miners that use set intersections on datasets of vertical formats. The new miner generates same rules that usually generated using apriori-like algorithms because it uses the same confidence and support thresholds definitions. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. This thesis also introduces a new MapReduce classifier that based MapReduce associative rule mining. This algorithm employs different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. The new classifier works on multi-class datasets and is able to produce multi-label predications with probabilities for each predicted label. To evaluate the classifier 20 different datasets from the UCI data collection were used. Results show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches. Also a MapReduce simulator was developed to measure the scalability of MapReduce based applications easily and quickly, and to captures the behaviour of algorithms on cluster environments. This also allows optimizing the configurations of MapReduce clusters to get better execution times and hardware utilization.
|
93 |
A dynamic logistic model for combining classifier outputsTomas, Amber Nede January 2008 (has links)
Many classification algorithms are designed on the assumption that the population of interest is stationary, i.e. it does not change over time. However, there are many real-world problems where this assumption is not appropriate. In this thesis, we develop a classifier for non-stationary populations which is based on a multiple logistic model for the conditional class probabilities and incorporates a linear combination of the outputs of a number of pre-determined component classifiers. The final classifier is able to adjust to changes in the population by sequential updating of the coefficients of the linear combination, which are the parameters of the model. The model we use is motivated by the relatively good classification performance which has been achieved by classification rules based on combining classifier outputs. However, in some cases such classifiers can also perform relatively poorly, and in general the mechanisms behind such results are little understood. For the model we propose, which is a generalisation of several existing models for stationary classification problems, we show there exists a simple relationship between the component classifiers which are used, the sign of the parameters and the decision boundaries of the final classifier. This relationship can be used to guide the choice of component classifiers, and helps with understanding the conditions necessary for the classifier to perform well. We compare several "on-line" algorithms for implementing the classification model, where the classifier is updated as new labelled observations become available. The predictive approach to classification is adopted, so each algorithm is based on updating the posterior distribution of the parameters as new information is received. Specifically, we compare a method which assumes the posterior distribution is Gaussian, a more general method developed for the class of Dynamic Generalised Linear Models, and a method based on a sequential Monte Carlo approximation of the posterior. The relationship between the model used for parameter evolution, the bias of the parameter estimates and the error of the classifier is explored.
|
94 |
Evolutionary reinforcement learning of spoken dialogue strategiesToney, Dave January 2007 (has links)
From a system developer's perspective, designing a spoken dialogue system can be a time-consuming and difficult process. A developer may spend a lot of time anticipating how a potential user might interact with the system and then deciding on the most appropriate system response. These decisions are encoded in a dialogue strategy, essentially a mapping between anticipated user inputs and appropriate system outputs. To reduce the time and effort associated with developing a dialogue strategy, recent work has concentrated on modelling the development of a dialogue strategy as a sequential decision problem. Using this model, reinforcement learning algorithms have been employed to generate dialogue strategies automatically. These algorithms learn strategies by interacting with simulated users. Some progress has been made with this method but a number of important challenges remain. For instance, relatively little success has been achieved with the large state representations that are typical of real-life systems. Another crucial issue is the time and effort associated with the creation of simulated users. In this thesis, I propose an alternative to existing reinforcement learning methods of dialogue strategy development. More specifically, I explore how XCS, an evolutionary reinforcement learning algorithm, can be used to find dialogue strategies that cover large state spaces. Furthermore, I suggest that hand-coded simulated users are sufficient for the learning of useful dialogue strategies. I argue that the use of evolutionary reinforcement learning and hand-coded simulated users is an effective approach to the rapid development of spoken dialogue strategies. Finally, I substantiate this claim by evaluating a learned strategy with real users. Both the learned strategy and a state-of-the-art hand-coded strategy were integrated into an end-to-end spoken dialogue system. The dialogue system allowed real users to make flight enquiries using a live database for an Edinburgh-based airline. The performance of the learned and hand-coded strategies were compared. The evaluation results show that the learned strategy performs as well as the hand-coded one (81% and 77% task completion respectively) but takes much less time to design (two days instead of two weeks). Moreover, the learned strategy compares favourably with previous user evaluations of learned strategies.
|
95 |
Sistemas classificadores evolutivos para problemas multirrótulo / Learning classifier system for multi-label classificationVallim, Rosane Maria Maffei 27 July 2009 (has links)
Classificação é, provavelmente, a tarefa mais estudada na área de Aprendizado de Máquina, possuindo aplicação em uma grande quantidade de problemas reais, como categorização de textos, diagnóstico médico, problemas de bioinformática, além de aplicações comerciais e industriais. De um modo geral, os problemas de classificação podem ser categorizados quanto ao número de rótulos de classe que podem ser associados à cada exemplo de entrada. A abordagem mais investigada pela comunidade de Aprendizado de Máquina é a de classes mutuamente exclusivas. Entretanto, existe uma grande variedade de problemas importantes em que cada exemplo de entrada pode ser associado a mais de um rótulo ou classe. Esses problemas são denominados problemas de classificação multirrótulo. Os Learning Classifier Systems(LCS) constituem uma técnica de Indução de Regras de Classificação que tem como principal mecanismo de busca um Algoritmo Genético. Essa técnica busca encontrar um conjunto de regras que tenha alta precisão de classificação, que seja compreensível e que possua regras consideradas interessantes sob o ponto de vista de classificação. Apesar de existirem na literatura diversos trabalhos sobre os LCS para problemas de classificação com classes mutuamente exclusivas, pouco se tem conhecimento sobre um LCS que seja capaz de lidar com problemas multirrótulo. Dessa maneira, o objetivo desta monografia é apresentar uma proposta de LCS para problemas multirrótulo, que pretende induzir um conjunto de regras de classificação que produza um resultado eficaz e comparável com outras técnicas de classificação. De acordo com esse objetivo, apresenta-se também uma revisão bibliográfica dos temas envolvidos na proposta, que são: Sistemas Classificadores Evolutivos e Classificação Multirrótulo / Classification is probably the most studied task in the Machine Learning area, with applications in a broad number of real problems like text categorization, medical diagnosis, bioinformatics and even comercial and industrial applications. Generally, classification problems can be categorized considering the number of class labels associated to each input instance. The most studied approach by the community of Machine Learning is the one that considers mutually exclusive classes. However, there is a large variety of important problems in which each instance can be associated to more than one class label. This problems are called multi-label classification problems. Learning Classifier Systems (LCS) are a technique for rule induction which uses a Genetic Algorithm as the primary search mechanism. This technique searchs for sets of rules that have high classification accuracy and that are also understandable and interesting on the classification point of view. Although there are several works on LCS for classification problems with mutually exclusive classes, there is no record of an LCS that can deal with the multi-label classification problem. The objective of this work is to propose an LCS for multi-label classification that builds a set of classification rules which achieves results that are efficient and comparable to other multi-label methods. In accordance with this objective this work also presents a review of the themes involved: Learning Classifier Systems and Multi-label Classification
|
96 |
Classificador de kernels para mapeamento em plataforma de computação híbrida composta por FPGA e GPP / Classifier of kernels for hybrid computing platform mapping composed by FPGA and GPPSumoyama, Alexandre Shigueru 17 May 2016 (has links)
O aumento constante da demanda por sistemas computacionais cada vez mais eficientes tem motivado a busca por sistemas híbridos customizados compostos por GPP (General Purpose Processor), FPGAs (Field-Programmable Gate Array) e GPUs (Graphics Processing Units). Quando utilizados em conjunto possibilitam otimizar a relação entre desempenho e consumo de energia. Tais sistemas dependem de técnicas que façam o mapeamento mais adequado considerando o perfil do código fonte. Nesse sentido, este projeto propõe uma técnica para realizar o mapeamento entre GPP e FPGA. Para isso, utilizou-se como base uma abordagem de mineração de dados que avalia a similaridade entre código fonte. A técnica aqui desenvolvida obteve taxas de acertos de 65,67% para códigos sintetizados para FPGA com a ferramenta LegUP e 59,19% para Impulse C, considerando que para GPP o código foi compilado com o GCC (GNU Compiler Collection) utilizando o suporte a OpenMP. Os resultados demonstraram que esta abordagem pode ser empregada como um ponto de decisão inicial no processo de mapeamento em sistemas híbridos, somente analisando o perfil do código fonte sem que haja a necessidade de execução do mesmo para a tomada de decisão. / The steady increasing on demand for efficient computer systems has been motivated the search for customized hybrid systems composed by GPP (general purpose processors), FPGAs (Field- Programmable Gate Array) and GPUs (Graphics Processing Units). When they are used together allow to exploit their computing resources to optimize performance and power consumption. Such systems rely on techniques make the most appropriate mapping considering the profile of source code. Thus, this project proposes a technique to perform the mapping between GPP and FPGA. For this, it is applied a technique based on a data mining approach that evaluates the similarity between source code. The proposed method obtained hit rate 65.67% for codes synthesized in FPGA using LegUP tool and 59.19% for Impulse C tool, whereas for GPP, the source code was compiled on GCC (GNU Compiler Collection) using OpenMP. The results demonstrated that this approach can be used as an initial decision point on the mapping process in hybrid systems, only analyzing the profile of the source code without the need for implementing it for decision-making.
|
97 |
Social training : aprendizado semi supervisionado utilizando funções de escolha social / Social-Training: Semi-Supervised Learning Using Social Choice FunctionsAlves, Matheus January 2017 (has links)
Dada a grande quantidade de dados gerados atualmente, apenas uma pequena porção dos mesmos pode ser rotulada manualmente por especialistas humanos. Isso é um desafio comum para aplicações de aprendizagem de máquina. Aprendizado semi-supervisionado aborda este problema através da manipulação dos dados não rotulados juntamente aos dados rotulados. Entretanto, se apenas uma quantidade limitada de exemplos rotulados está disponível, o desempenho da tarefa de aprendizagem de máquina (e.g., classificação) pode ser não satisfatória. Diversas soluções abordam este problema através do uso de uma ensemble de classificadores, visto que essa abordagem aumenta a diversidade dos classificadores. Algoritmos como o co-training e o tri-training utilizam múltiplas partições de dados ou múltiplos algoritmos de aprendizado para melhorar a qualidade da classificação de instâncias não rotuladas através de concordância por maioria simples. Além disso, existem abordagens que estendem esta ideia e adotam processos de votação menos triviais para definir os rótulos, como eleição por maioria ponderada, por exemplo. Contudo, estas soluções requerem que os rótulos possuam um certo nível de confiança para serem utilizados no treinamento. Consequentemente, nem toda a informação disponível é utilizada. Por exemplo: informações associadas a níveis de confiança baixos são totalmente ignoradas. Este trabalho propõe uma abordagem chamada social-training, que utiliza toda a informação disponível na tarefa de aprendizado semi-supervisionado. Para isto, múltiplos classificadores heterogêneos são treinados com os dados rotulados e geram diversas classificações para as mesmas instâncias não rotuladas. O social-training, então, agrega estes resultados em um único rótulo por meio de funções de escolha social que trabalham com agregação de rankings sobre as instâncias. Especificamente, a solução trabalha com casos de classificação binária. Os resultados mostram que trabalhar com o ranking completo, ou seja, rotular todas as instâncias não rotuladas, é capaz de reduzir o erro de classificação para alguns conjuntos de dados da base da UCI utilizados. / Given the huge quantity of data currently being generated, just a small portion of it can be manually labeled by human experts. This is a challenge for machine learning applications. Semi-supervised learning addresses this problem by handling unlabeled data alongside labeled ones. However, if only a limited quantity of labeled examples is available, the performance of the machine learning task (e.g., classification) can be very unsatisfactory. Many solutions address this issue by using a classifier ensemble because this increases diversity. Algorithms such as co-training and tri-training use multiple views or multiple learning algorithms in order to improve the classification of unlabeled instances through simple majority agreement. Also, there are approaches that extend this idea and adopt less trivial voting processes to define the labels, like weighted majority voting. Nevertheless, these solutions require some confidence level on the label in order to use it for training. Hence, not all information is used, i.e., information associated with low confidence level is disregarded completely. An approach called social-training is proposed, which uses all information available in the semi-supervised learning task. For this, multiple heterogeneous classifiers are trained with the labeled data and generate diverse classifications for the same unlabeled instances. Social-training then aggregates these results into a single label by means of social choice functions that work with rank aggregation over the instances. The solution addresses binary classification cases. The results show that working with the full ranking, i.e., labeling all unlabeled instances, is able to reduce the classification error for some UCI data sets used.
|
98 |
Automated retrieval and extraction of training course information from unstructured web pagesXhemali, Daniela January 2010 (has links)
Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results. This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations. The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance.
|
99 |
Analysis of Current Flows in Electrical Networks for Error-Tolerant Graph MatchingGutierrez Munoz, Alejandro 10 November 2008 (has links)
Information contained in chemical compounds, fingerprint databases, social networks, and interactions between websites all have one thing in common: they can be represented as graphs. The need to analyze, compare, and classify graph datasets has become more evident over the last decade. The graph isomorphism problem is known to belong to the NP class, and the subgraph isomorphism problem is known to be an NP-complete problem. Several error-tolerant graph matching techniques have been developed during the last two decades in order to overcome the computational complexity associated with these problems. Some of these techniques rely upon similarity measures based on the topology of the graphs. Random walks and edit distance kernels are examples of such methods. In conjunction with learning algorithms like back-propagation neural networks, k-nearest neighbor, and support vector machines (SVM), these methods provide a way of classifying graphs based on a training set of labeled instances.
This thesis presents a novel approach to error-tolerant graph matching based on current flow analysis. Analysis of current flow in electrical networks is a technique that uses the voltages and currents obtained through nodal analysis of a graph representing an electrical circuit. Current flow analysis in electrical networks shares some interesting connections with the number of random walks along the graph.
We propose an algorithm to calculate a similarity measure between two graphs based on the current flows along geodesics of the same degree. This similarity measure can be applied over large graph datasets, allowing these datasets to be compared in a reasonable amount of time. This thesis investigates the classification potential of several data mining algorithms based on the information extracted from a graph dataset and represented as current flow vectors. We describe our operational prototype and evaluate its effectiveness on the NCI-HIV dataset.
|
100 |
Classification techniques for hyperspectral remote sensing image dataJia, Xiuping, Electrical Engineering, Australian Defence Force Academy, UNSW January 1996 (has links)
Hyperspectral remote sensing image data, such as that recorded by AVIRIS with 224 spectral bands, provides rich information on ground cover types. However, it presents new problems in machine assisted interpretation, mainly in long processing times and the difficulties of class training due to the low ratio of number of training samples to the number of bands. This thesis investigates feasible and efficient feature reduction and image classification techniques which are appropriate for hyperspectral image data. The study is reported in three parts. The first concerns a deterministic approach for hyperspectral data interpretation. Multigroup and multiple threshold spectral coding procedures, and associated techniques for spectral matching and classification, are proposed and tested. By coding on subgroups of bands using one or three thresholds, spectral searching and matching becomes simple, fast and free of the need for radiometric correction. Modifications of existing statistical techniques are proposed in the second part of the investigation A block-based maximum likelihood classification technique is developed. Several subgroups are formed from the complete set of spectral bands in the data, based on the properties of global correlation among the bands. Subgroups which are poorly correlated with each other are treated independently using conventional maximum likelihood classification. Experimental results demonstrate that, when using appropriate subgroup sizes, the new method provides a compromise among classification accuracy, processing time and available training pixels. Furthermore, a segmented, and possibly multi-layer, principal components transformation is proposed as a possible feature reduction technique prior to classification, and for effective colour display. The transformation is performed efficiently on each of the highly correlated subgroups of bands independently. Selected features from each transformed subgroup can be then transformed again to achieve a satisfactory data reduction ratio and to generate the three most significant components for colour display. Classification accuracy is improved and high quality colour image display is achieved in experiments using two AVIRIS data sets.
|
Page generated in 0.0339 seconds