Global ETD Search

21	Voting-Based Consensus of Data Partitions Ayad, Hanan 08 1900 (has links) Over the past few years, there has been a renewed interest in the consensus problem for ensembles of partitions. Recent work is primarily motivated by the developments in the area of combining multiple supervised learners. Unlike the consensus of supervised classifications, the consensus of data partitions is a challenging problem due to the lack of globally defined cluster labels and to the inherent difficulty of data clustering as an unsupervised learning problem. Moreover, the true number of clusters may be unknown. A fundamental goal of consensus methods for partitions is to obtain an optimal summary of an ensemble and to discover a cluster structure with accuracy and robustness exceeding those of the individual ensemble partitions. The quality of the consensus partitions highly depends on the ensemble generation mechanism and on the suitability of the consensus method for combining the generated ensemble. Typically, consensus methods derive an ensemble representation that is used as the basis for extracting the consensus partition. Most ensemble representations circumvent the labeling problem. On the other hand, voting-based methods establish direct parallels with consensus methods for supervised classifications, by seeking an optimal relabeling of the ensemble partitions and deriving an ensemble representation consisting of a central aggregated partition. An important element of the voting-based aggregation problem is the pairwise relabeling of an ensemble partition with respect to a representative partition of the ensemble, which is refered to here as the voting problem. The voting problem is commonly formulated as a weighted bipartite matching problem. In this dissertation, a general theoretical framework for the voting problem as a multi-response regression problem is proposed. The problem is formulated as seeking to estimate the uncertainties associated with the assignments of the objects to the representative clusters, given their assignments to the clusters of an ensemble partition. A new voting scheme, referred to as cumulative voting, is derived as a special instance of the proposed regression formulation corresponding to fitting a linear model by least squares estimation. The proposed formulation reveals the close relationships between the underlying loss functions of the cumulative voting and bipartite matching schemes. A useful feature of the proposed framework is that it can be applied to model substantial variability between partitions, such as a variable number of clusters. A general aggregation algorithm with variants corresponding to cumulative voting and bipartite matching is applied and a simulation-based analysis is presented to compare the suitability of each scheme to different ensemble generation mechanisms. The bipartite matching is found to be more suitable than cumulative voting for a particular generation model, whereby each ensemble partition is generated as a noisy permutation of an underlying labeling, according to a probability of error. For ensembles with a variable number of clusters, it is proposed that the aggregated partition be viewed as an estimated distributional representation of the ensemble, on the basis of which, a criterion may be defined to seek an optimally compressed consensus partition. The properties and features of the proposed cumulative voting scheme are studied. In particular, the relationship between cumulative voting and the well-known co-association matrix is highlighted. Furthermore, an adaptive aggregation algorithm that is suited for the cumulative voting scheme is proposed. The algorithm aims at selecting the initial reference partition and the aggregation sequence of the ensemble partitions the loss of mutual information associated with the aggregated partition is minimized. In order to subsequently extract the final consensus partition, an efficient agglomerative algorithm is developed. The algorithm merges the aggregated clusters such that the maximum amount of information is preserved. Furthermore, it allows the optimal number of consensus clusters to be estimated. An empirical study using several artificial and real-world datasets demonstrates that the proposed cumulative voting scheme leads to discovering substantially more accurate consensus partitions compared to bipartite matching, in the case of ensembles with a relatively large or a variable number of clusters. Compared to other recent consensus methods, the proposed method is found to be comparable with or better than the best performing methods. Moreover, accurate estimates of the true number of clusters are often achieved using cumulative voting, whereas consistently poor estimates are achieved based on bipartite matching. The empirical evidence demonstrates that the bipartite matching scheme is not suitable for these types of ensembles. Electrical and Computer Engineering
22	Multiple Cooperative Swarms for Data Clustering Ahmadi, Abbas January 2008 (has links) Exploring a set of unlabeled data to extract the similar clusters, known as data clustering, is an appealing problem in machine learning. In other words, data clustering organizes the underlying data into different groups using a notion of similarity between patterns. A new approach to solve the data clustering problem based on multiple cooperative swarms is introduced. The proposed approach is inspired by the social swarming behavior of biological bird flocks which search for food situated in several places. The proposed approach is composed of two main phases, namely, initialization and exploitation. In the initialization phase, the aim is to distribute the search space among several swarms. That is, a part of the search space is assigned to each swarm in this phase. In the exploitation phase, each swarm searches for the center of its associated cluster while cooperating with other swarms. The search proceeds to converge to a near-optimal solution. As compared to the single swarm clustering approach, the proposed multiple cooperative swarms provide better solutions in terms of fitness function measure for the cluster centers, as the dimensionality of data and number of clusters increase. The multiple cooperative swarms clustering approach assumes that the number of clusters is known a priori. The notion of stability analysis is proposed to extract the number of clusters for the underlying data using multiple cooperative swarms. The mathematical explanations demonstrating why the proposed approach leads to more stable and robust results than those of the single swarm clustering are also provided. Application of the proposed multiple cooperative swarms clustering is considered for one of the most challenging problems in speech recognition: phoneme recognition. The proposed approach is used to decompose the recognition task into a number of subtasks or modules. Each module involves a set of similar phonemes known as a phoneme family. Basically, the goal is to obtain the best solution for phoneme families using the proposed multiple cooperative swarms clustering. The experiments using the standard TIMIT corpus indicate that using the proposed clustering approach boosts the accuracy of the modular approach for phoneme recognition considerably. Data clustering Swarm intelligence Particle swarm optimization Cooperative swarms System Design Engineering
23	Voting-Based Consensus of Data Partitions Ayad, Hanan 08 1900 (has links) Over the past few years, there has been a renewed interest in the consensus problem for ensembles of partitions. Recent work is primarily motivated by the developments in the area of combining multiple supervised learners. Unlike the consensus of supervised classifications, the consensus of data partitions is a challenging problem due to the lack of globally defined cluster labels and to the inherent difficulty of data clustering as an unsupervised learning problem. Moreover, the true number of clusters may be unknown. A fundamental goal of consensus methods for partitions is to obtain an optimal summary of an ensemble and to discover a cluster structure with accuracy and robustness exceeding those of the individual ensemble partitions. The quality of the consensus partitions highly depends on the ensemble generation mechanism and on the suitability of the consensus method for combining the generated ensemble. Typically, consensus methods derive an ensemble representation that is used as the basis for extracting the consensus partition. Most ensemble representations circumvent the labeling problem. On the other hand, voting-based methods establish direct parallels with consensus methods for supervised classifications, by seeking an optimal relabeling of the ensemble partitions and deriving an ensemble representation consisting of a central aggregated partition. An important element of the voting-based aggregation problem is the pairwise relabeling of an ensemble partition with respect to a representative partition of the ensemble, which is refered to here as the voting problem. The voting problem is commonly formulated as a weighted bipartite matching problem. In this dissertation, a general theoretical framework for the voting problem as a multi-response regression problem is proposed. The problem is formulated as seeking to estimate the uncertainties associated with the assignments of the objects to the representative clusters, given their assignments to the clusters of an ensemble partition. A new voting scheme, referred to as cumulative voting, is derived as a special instance of the proposed regression formulation corresponding to fitting a linear model by least squares estimation. The proposed formulation reveals the close relationships between the underlying loss functions of the cumulative voting and bipartite matching schemes. A useful feature of the proposed framework is that it can be applied to model substantial variability between partitions, such as a variable number of clusters. A general aggregation algorithm with variants corresponding to cumulative voting and bipartite matching is applied and a simulation-based analysis is presented to compare the suitability of each scheme to different ensemble generation mechanisms. The bipartite matching is found to be more suitable than cumulative voting for a particular generation model, whereby each ensemble partition is generated as a noisy permutation of an underlying labeling, according to a probability of error. For ensembles with a variable number of clusters, it is proposed that the aggregated partition be viewed as an estimated distributional representation of the ensemble, on the basis of which, a criterion may be defined to seek an optimally compressed consensus partition. The properties and features of the proposed cumulative voting scheme are studied. In particular, the relationship between cumulative voting and the well-known co-association matrix is highlighted. Furthermore, an adaptive aggregation algorithm that is suited for the cumulative voting scheme is proposed. The algorithm aims at selecting the initial reference partition and the aggregation sequence of the ensemble partitions the loss of mutual information associated with the aggregated partition is minimized. In order to subsequently extract the final consensus partition, an efficient agglomerative algorithm is developed. The algorithm merges the aggregated clusters such that the maximum amount of information is preserved. Furthermore, it allows the optimal number of consensus clusters to be estimated. An empirical study using several artificial and real-world datasets demonstrates that the proposed cumulative voting scheme leads to discovering substantially more accurate consensus partitions compared to bipartite matching, in the case of ensembles with a relatively large or a variable number of clusters. Compared to other recent consensus methods, the proposed method is found to be comparable with or better than the best performing methods. Moreover, accurate estimates of the true number of clusters are often achieved using cumulative voting, whereas consistently poor estimates are achieved based on bipartite matching. The empirical evidence demonstrates that the bipartite matching scheme is not suitable for these types of ensembles. Electrical and Computer Engineering
24	Multiple Cooperative Swarms for Data Clustering Ahmadi, Abbas January 2008 (has links) Exploring a set of unlabeled data to extract the similar clusters, known as data clustering, is an appealing problem in machine learning. In other words, data clustering organizes the underlying data into different groups using a notion of similarity between patterns. A new approach to solve the data clustering problem based on multiple cooperative swarms is introduced. The proposed approach is inspired by the social swarming behavior of biological bird flocks which search for food situated in several places. The proposed approach is composed of two main phases, namely, initialization and exploitation. In the initialization phase, the aim is to distribute the search space among several swarms. That is, a part of the search space is assigned to each swarm in this phase. In the exploitation phase, each swarm searches for the center of its associated cluster while cooperating with other swarms. The search proceeds to converge to a near-optimal solution. As compared to the single swarm clustering approach, the proposed multiple cooperative swarms provide better solutions in terms of fitness function measure for the cluster centers, as the dimensionality of data and number of clusters increase. The multiple cooperative swarms clustering approach assumes that the number of clusters is known a priori. The notion of stability analysis is proposed to extract the number of clusters for the underlying data using multiple cooperative swarms. The mathematical explanations demonstrating why the proposed approach leads to more stable and robust results than those of the single swarm clustering are also provided. Application of the proposed multiple cooperative swarms clustering is considered for one of the most challenging problems in speech recognition: phoneme recognition. The proposed approach is used to decompose the recognition task into a number of subtasks or modules. Each module involves a set of similar phonemes known as a phoneme family. Basically, the goal is to obtain the best solution for phoneme families using the proposed multiple cooperative swarms clustering. The experiments using the standard TIMIT corpus indicate that using the proposed clustering approach boosts the accuracy of the modular approach for phoneme recognition considerably. Data clustering Swarm intelligence Particle swarm optimization Cooperative swarms System Design Engineering
25	Automatic Essential Content Extraction from Asynchronous Discussion Boards in e-Learning Lu, Ping-Hui 03 July 2004 (has links) With the trend of using of Internet and multimedia, e-Learning has became an important learning method. e-Learning is easy to use and bring into practice, but it also has the defects inversely. One of those defects is the reuse of important and valuable discussing knowledge from asynchronous discussion boards. Nobody has time and be willing to make effort to manage the important discussion from asynchronous discussion boards in e-Learning except enthusiastic teachers or assistants. All of us know that asynchronous discussion boards is an important tool used in e-Learning for communicating and discussing with all class information for teachers and students. And reusing of important class discussing knowledge can aid teachers and students to teach and study with efficiency and effect. But up to the present, there are few researches in this domain. So, in this research we create an automatic essential content extraction method from asynchronous discussion boards in e-Learning. We explain the usage, management, and shortcomings of asynchronous discussion boards in e-Learning before. And we also describe the designing process of the research in detail. Finally, we describe the operation and the result of content extraction in this research system. All of those are hope to help teachers and students can reuse the valuable knowledge easily and quickly from past class discussion in e-Learning. Data Clustering e-Learning Asynchronous Discussion Board Self Organization Map Information Retrieval
26	Mining IT Product Life Cycle from Massive Newsgroup Articles Chou, Cheng-Chi 22 July 2003 (has links) Product life cycle (PLC) may be used as a managerial tool. Marketing strategies must change as the product goes through its life cycle. If managers understand the cycle concept, they are in a better position to forecast the future sales activities and plan marketing strategies. However, people often make the wrong PLC because of the difficulty of data access and lacking decision-making information. Therefore, this thesis applies customer behavior model to analyze the relationship between the frequency and the duration time from the product discussion, and it calculates the PLC pattern to explore the product¡¦s current position in customers¡¦ mind. Finally, the PLC curve will be constructed by using the information that we got from previous analysis. Moreover, we also employ data mining and information retrieval technique to diagnose the variance of discussion frequency and the content of discussion article to extract the distinctive event that influenced PLC curve. The main contributions of this thesis are described as the following sentence: Self-Organization Map Moving average Information Retrieval Product Life Cycle Data Clustering
27	Daugiamačio pasiskirstymo tankio neparametrinis įvertinimas naudojant stebėjimų klasterizavimą / The nonparametric estimation of multivariate distribution density applying clustering procedures Ruzgas, Tomas 14 March 2007 (has links) The paper is devoted to statistical nonparametric estimation of multivariate distribution density. The influence of data pre-clustering on the estimation accuracy of multimodal density is analysed by means of the Monte-Carlo method. Mathematics Daugiamatis pasiskirstymo tankis Neparametrinis vertinimas Multivariate distribution density Non-parametric estimation Duomenų klasterizavimas Data clustering
28	Daugiamačio pasiskirstymo tankio neparametrinis įvertinimas naudojant stebėjimų klasterizavimą / The nonparametric estimation of multivariate distribution density applying clustering procedures Ruzgas, Tomas 15 March 2007 (has links) The paper is devoted to statistical nonparametric estimation of multivariate distribution density. The influence of data pre-clustering on the estimation accuracy of multimodal density is analysed by means of the Monte-Carlo method. Mathematics Daugiamatis pasiskirstymo tankis Multivariate distribution density Neparametrinis vertinimas Duomenų klasterizavimas Data clustering Non-parametric estimation
29	Técnicas de agrupamento de dados para computação aproximativa Malfatti, Guilherme Meneguzzi January 2017 (has links) Dois dos principais fatores do aumento da performance em aplicações single-thread – frequência de operação e exploração do paralelismo no nível das instruções – tiveram pouco avanço nos últimos anos devido a restrições de potência. Neste contexto, considerando a natureza tolerante a imprecisões (i.e.: suas saídas podem conter um nível aceitável de ruído sem comprometer o resultado final) de muitas aplicações atuais, como processamento de imagens e aprendizado de máquina, a computação aproximativa torna-se uma abordagem atrativa. Esta técnica baseia-se em computar valores aproximados ao invés de precisos que, por sua vez, pode aumentar o desempenho e reduzir o consumo energético ao custo de qualidade. No atual estado da arte, a forma mais comum de exploração da técnica é através de redes neurais (mais especificamente, o modelo Multilayer Perceptron), devido à capacidade destas estruturas de aprender funções arbitrárias e aproximá-las. Tais redes são geralmente implementadas em um hardware dedicado, chamado acelerador neural. Contudo, essa execução exige uma grande quantidade de área em chip e geralmente não oferece melhorias suficientes que justifiquem este espaço adicional. Este trabalho tem por objetivo propor um novo mecanismo para fazer computação aproximativa, baseado em reúso aproximativo de funções e trechos de código. Esta técnica agrupa automaticamente entradas e saídas de dados por similaridade, armazena-os em uma tabela em memória controlada via software. A partir disto, os valores quantizados podem ser reutilizados através de uma busca a essa tabela, onde será selecionada a saída mais apropriada e desta forma a execução do trecho de código será substituído. A aplicação desta técnica é bastante eficaz, sendo capaz de alcançar uma redução, em média, de 97.1% em Energy-Delay-Product (EDP) quando comparado a aceleradores neurais. / Two of the major drivers of increased performance in single-thread applications - increase in operation frequency and exploitation of instruction-level parallelism - have had little advances in the last years due to power constraints. In this context, considering the intrinsic imprecision-tolerance (i.e., outputs may present an acceptable level of noise without compromising the result) of many modern applications, such as image processing and machine learning, approximate computation becomes a promising approach. This technique is based on computing approximate instead of accurate results, which can increase performance and reduce energy consumption at the cost of quality. In the current state of the art, the most common way of exploiting the technique is through neural networks (more specifically, the Multilayer Perceptron model), due to the ability of these structures to learn arbitrary functions and to approximate them. Such networks are usually implemented in a dedicated neural accelerator. However, this implementation requires a large amount of chip area and usually does not offer enough improvements to justify this additional cost. The goal of this work is to propose a new mechanism to address approximate computation, based on approximate reuse of functions and code fragments. This technique automatically groups input and output data by similarity and stores this information in a sofware-controlled memory. Based on these data, the quantized values can be reused through a search to this table, in which the most appropriate output will be selected and, therefore, execution of the original code will be replaced. Applying this technique is effective, achieving an average 97.1% reduction in Energy-Delay-Product (EDP) when compared to neural accelerators. Redes neurais Cluster Approximate Computing High Performance Neural networks Data Clustering
30	Granular computing approach for intelligent classifier design Al-Shammaa, Mohammed January 2016 (has links) Granular computing facilitates dealing with information by providing a theoretical framework to deal with information as granules at different levels of granularity (different levels of specificity/abstraction). It aims to provide an abstract explainable description of the data by forming granules that represent the features or the underlying structure of corresponding subsets of the data. In this thesis, a granular computing approach to the design of intelligent classification systems is proposed. The proposed approach is employed for different classification systems to investigate its efficiency. Fuzzy inference systems, neural networks, neuro-fuzzy systems and classifier ensembles are considered to evaluate the efficiency of the proposed approach. Each of the considered systems is designed using the proposed approach and classification performance is evaluated and compared to that of the standard system. The proposed approach is based on constructing information granules from data at multiple levels of granularity. The granulation process is performed using a modified fuzzy c-means algorithm that takes classification problem into account. Clustering is followed by a coarsening process that involves merging small clusters into large ones to form a lower granularity level. The resulted granules are used to build each of the considered binary classifiers in different settings and approaches. Granules produced by the proposed granulation method are used to build a fuzzy classifier for each granulation level or set of levels. The performance of the classifiers is evaluated using real life data sets and measured by two classification performance measures: accuracy and area under receiver operating characteristic curve. Experimental results show that fuzzy systems constructed using the proposed method achieved better classification performance. In addition, the proposed approach is used for the design of neural network classifiers. Resulted granules from one or more granulation levels are used to train the classifiers at different levels of specificity/abstraction. Using this approach, the classification problem is broken down into the modelling of classification rules represented by the information granules resulting in more interpretable system. Experimental results show that neural network classifiers trained using the proposed approach have better classification performance for most of the data sets. In a similar manner, the proposed approach is used for the training of neuro-fuzzy systems resulting in similar improvement in classification performance. Lastly, neural networks built using the proposed approach are used to construct a classifier ensemble. Information granules are used to generate and train the base classifiers. The final ensemble output is produced by a weighted sum combiner. Based on the experimental results, the proposed approach has improved the classification performance of the base classifiers for most of the data sets. Furthermore, a genetic algorithm is used to determine the combiner weights automatically. 006.3

Search results