431 |
Seleção de atributos via agrupamento / Clustering-based feature selectionCovões, Thiago Ferreira 22 February 2010 (has links)
O avanço tecnológico teve como consequência a geração e o armazenamento de quantidades abundantes de dados. Para conseguir extrair o máximo de informação possível dos dados tornou-se necessária a formulação de novas ferramentas de análise de dados. Foi então introduzido o Processo de Descoberta de Conhecimento em Bancos de Dados, que tem como objetivo a identificação de padrôes válidos, novos, potencialmente úteis e compreensíveis em grandes bancos de dados. Nesse processo, a etapa responsável por encontrar padrões nos dados é denominada de Mineração de Dados. A acurácia e eficiência de algoritmos de mineração de dados dependem diretamente da quantidade e da qualidade dos dados que serão analisados. Nesse sentido, atributos redundantes e/ou não-informativos podem tornar o processo de mineração de dados ineficiente. Métodos de Seleção de Atributos podem remover tais atributos. Nesse trabalho é proposto um algoritmo para seleção de atributos e algumas de suas variantes. Tais algoritmos procuram identificar redundância por meio do agrupamento de atributos. A identificação de atributos redundantes pode auxiliar não apenas no processo de identificação de padrões, mas também pode favorecer a compreensibilidade do modelo obtido. O algoritmo proposto e suas variantes são comparados com dois algoritmos do mesmo gênero descritos na literatura. Tais algoritmos foram avaliados em problemas típicos de mineração de dados: classificação e agrupamento de dados. Os resultados das avaliações mostram que o algoritmo proposto, e suas variantes, fornecem bons resultados tanto do ponto de vista de acurácia como de eficiência computacional, sem a necessidade de definição de parâmetros críticos pelo usuário / The technological progress has lead to the generation and storage of abundant amounts of data. The extraction of information from such data has required the formulation of new data analysis tools. In this context, the Knowledge Discovery from Databases process was introduced. It is focused on the identification of valid, new, potentially useful, and comprehensible patterns in large databases. In this process, the task of finding patterns in data is usually called Data Mining. The efficacy and efficiency of data mining algorithms are directly influenced by the amount and quality of the data being analyzed. Redundant and/or uninformative features may make the data mining process inefficient. In this context, feature selection methods that can remove such features are frequently used. This work proposes a feature selection algorithm and some of its variants that are capable of identifying redundant features through clustering. The identification of redundant features can favor not only the pattern recognition process but also the comprehensibility of the obtained model. The proposed method and its variants are compared with two feature selection algorithms based on feature clustering. These algorithms were evaluated in two well known data mining problems: classification and clustering. The results obtained show that the proposed algorithm obtained good accuracy and computational efficiency results, additionally not requiring the definition of critical parameters by the user
|
432 |
Um novo algoritmo de clustering para a organização tridimensional de dados de expressão gênica / A new clustering algorithm for tridimensional gene expression dataLopes, Tiago José da Silva 29 March 2007 (has links)
Neste trabalho desenvolvemos um novo algoritmo para clustering para dados de expressão gênica. As abordagens tradicionais utilizam um conjunto de dados na forma de uma tabela de duas dimensões, onde as linhas são os genes e as colunas são as condições experimentais. Nós utilizamos uma estrutura de três dimensões, acrescentando fatias de tempo. Implementamos nosso algoritmo e testamos com conjuntos de dados sintéticos e dados reais, usando índices de validação para comparar os resultados obtidos pelo nosso algoritmo com os resultados produzidos pelo algoritmo TriCluster. Os resultados mostraram que o nosso algoritmo é bom para dados de expressão gênica em três dimensões e pode ser aplicado a dados de outros domínios / In this study we developed a new clustring algorithm for gene expression data. Previous solutions use a dataset in the form of a table, where the rows are the genes and the columns are the experimental conditions. We used a three-dimensional structure adding time-slices. We implemented this algorithm and tested it with synthetic and real data, using validation index to compare our results with the results obtained by the TriCluster algotithm. Results show that our solution is good for three dimensional gene expression data and can be employed to other domains
|
433 |
Um framework para análise de agrupamento baseado na combinação multi-objetivo de algoritmos de agrupamento / A framework for cluster analysis based in the multi-objective combination of clustering algorithmsFaceli, Katti 08 November 2006 (has links)
Esta Tese apresenta um framework para análise exploratória de dados via técnicas de agrupamento. O objetivo é facilitar o trabalho dos especialistas no domínio dos dados. O ponto central do framework é um algoritmo de ensemble multi-objetivo, o algoritmo MOCLE, complementado por um método para a visualização integrada de um conjunto de partições. Pela aplicação conjunta das idéias de ensemble de agrupamentos e agrupamento multi-objetivo, o MOCLE efetua atomaticamente importantes passos da análise de agrupamento: executa vários algoritmos conceitualmente diferentes com várias configurações de parâmetros, combina as partições resultantes desses algoritmos e seleciona as partições com os melhores compromissos de diferentes medidas de validação. MOCLE é uma abordagem robusta para lidar com diferentes tipos de estrutura que podem estar presentes em um conjunto de dados. Ele resulta em um conjunto conciso e estável de estruturas alternativas de alta qualidade, sem a necessidade de conhecimento prévio sobre os dados e nem conhecimento profundo em análise de agrupamento. Além disso, para facilitar a descoberta de estruturas mais complexas, o MOCLE permite a integração automática de conhecimento prévio de uma estrutura simples por meio das suas funções objetivo. Finalmente, o método de visualização proposto permite a observação simultânea de um conjunto de partições. Isso ajuda na análise dos resultados do MOCLE. / This Thesis presents a framework for exploratory data analysis via clustering techniques. The goal is to facilitate the work of the experts in the data domain. The core of the framework is a multi-objective clustering ensemble algorithm, the MOCLE algorithm, complemented by a method for integrated visualization of a set of partitions. By applying together the ideas of clustering ensemble and multi-objective clustering, MOCLE automatically performs important steps of cluster analysis: run several conceptually different clustering algorithms with various parameter configuration, combine the partitions resulting from these algorithms, and select the partitions with the best trade-offs for different validation measures. MOCLE is a robust approach to deal with different types of structures that can be present in a dataset. It results in a concise and stable set of high quality alternative structures, without the need of previous knowledge about the data or deep knowledge on cluster analysis. Furthermore, in order to facilitate the discovery of more complex structures, MOCLE allows the automatic integration of previous knowledge of a simple structure via their objective functions. Finally, the visualization method proposed allows the simultaneous observation of a set of partitions. This helps in the analysis of MOCLE results.
|
434 |
Measuring Data Abstraction Quality in Multiresolution VisualizationsCui, Qingguang 11 April 2007 (has links)
Data abstraction techniques are widely used in multiresolution visualization systems to reduce visual clutter and facilitate analysis from overview to detail. However, analysts are usually unaware of how well the abstracted data represent the original dataset, which can impact the reliability of results gleaned from the abstractions. In this thesis, we define three types of data abstraction quality measures for computing the degree to which the abstraction conveys the original dataset: the Histogram Difference Measure, the Nearest Neighbor Measure and Statistical Measure. They have been integrated within XmdvTool, a public-domain multiresolution visualization system for multivariate data analysis that supports sampling as well as clustering to simplify data. Several interactive operations are provided, including adjusting the data abstraction level, changing selected regions, and setting the acceptable data abstraction quality level. Conducting these operations, analysts can select an optimal data abstraction level. We did an evaluation to check how well the data abstraction measures conform to the data abstraction quality perceived by users. We adjusted the data abstraction measures based on the results of the evaluation. We also experimented on the measures with different distance methods and different computing mechanisms, in order to find the optimal variation from many variations of each type of measure. Finally, we developed two case studies to demonstrate how analysts can compare different abstraction methods using the measures to see how well relative data density and outliers are maintained, and then select an abstraction method that meets the requirement of their analytic tasks.
|
435 |
Cluster heads selection and cooperative nodes selection for cluster-based Internet of Things networksSong, Liumeng January 2017 (has links)
Clustering and cooperative transmission are the key enablers in power-constrained Internet of Things (IoT) networks. The challenges for power-constrained devices in IoT networks are to reduce the energy consumption and to guarantee the Quality of Service (QoS) provision. In this thesis, optimal node selection algorithms based on clustering and cooperative communication are proposed for different network scenarios, in particular: • The QoS-aware energy efficient cluster heads (CHs) selection algorithm in one-hop capillary networks. This algorithm selects the optimum set of CHs and construct clusters accordingly based on the location and residual energy of devices. • Cooperative nodes selection algorithms for cluster-based capillary networks. By utilising the spacial diversity of cooperative communication, these algorithms select the optimum set of cooperative nodes to assist the CHs for the long-haul transmission. In addition, with the regard of evenly energy distribution in one-hop cluster-based capillary networks, the CH selection is taken into consideration when developing cooperative devices selection algorithms. The performance of proposed selection algorithms are evaluated via comprehensive simulations. Simulation results show that the proposed algorithms can achieve up to 20% network lifetime longevity and up to 50% overall packet error rate (PER) decrement. Furthermore, the simulation results also prove that the optimal tradeoff between energy efficiency and QoS provision can be achieved in one-hop and multi-hop cluster-based scenarios.
|
436 |
Application of image segmentation in inspection of welding : Practical research in MATLABShen, Jiannan January 2012 (has links)
As one of main methods in modern steel production, welding plays a very important role in our national economy, which has been widely applied in many fields such as aviation, petroleum, chemicals, electricity, railways and so on. The craft of welding can be improved in terms of welding tools, welding technology and welding inspection. However, so far welding inspection has been a very complicated problem. Therefore, it is very important to effectively detect internal welding defects in the welded-structure part and it is worth to furtherly studying and researching.In this paper, the main task is research about the application of image segmentation in welding inspection. It is introduced that the image enhancement techniques and image segmentation techniques including image conversion, noise removal as well as threshold, clustering, edge detection and region extraction. Based on the MATLAB platform, it focuses on the application of image segmentation in ray detection of steeled-structure, found out the application situation of three different image segmentation method such as threshold, clustering and edge detection.Application of image segmentation is more competitive than image enhancement because that:1. Gray-scale based FCM clustering of image segmentation performs well, which can exposure pixels in terms of grey value level so as that it can show hierarchical position of related defects by grey value.2. Canny detection speeds also fast and performs well, that gives enough detail information around edges and defects with smooth lines.3. Image enhancement only could improve image quality including clarity and contrast, which can’t give other helpful information to detect welding defects.This paper comes from the actual needs of the industrial work and it proves to be practical at some extent. Moreover, it also demonstrates the next improvement direction including identification of welding defects based on the neural networks, and improved clustering algorithm based on the genetic ideas. / Program: Magisterutbildning i informatik
|
437 |
Hypothesis formulation in medical records spaceBa-Dhfari, Thamer Omer Faraj January 2017 (has links)
Patient medical records are a valuable resource that can be used for many purposes including managing and planning for future health needs as well as clinical research. Health databases such as the clinical practice research datalink (CPRD) and many other similar initiatives can provide researchers with a useful data source on which they can test their medical hypotheses. However, this can only be the case when researchers have a good set of hypotheses to test on the data. Conversely, the data may have other equally important areas that remain unexplored. There is a chance that some important signals in the data could be missed. Therefore, further analysis is required to make such hidden areas become more obvious and attainable for future exploration and investigation. Data mining techniques can be effective tools in discovering patterns and signals in large-scale patient data sets. These techniques have been widely applied to different areas in medical domain. Therefore, analysing patient data using such techniques has the potential to explore the data and to provide a better understanding of the information in patient records. However, the heterogeneity and complexity of medical data can be an obstacle in applying data mining techniques. Much of the potential value of this data therefore goes untapped. This thesis describes a novel methodology that reduces the dimensionality of primary care data, to make it more amenable to visualisation, mining and clustering. The methodology involves employing a combination of ontology-based semantic similarity and principal component analysis (PCA) to map the data into an appropriate and informative low dimensional space. The aim of this thesis is to develop a novel methodology that provides a visualisation of patient records. This visualisation provides a systematic method that allows the formulation of new and testable hypotheses which can be fed to researchers to carry out the subsequent phases of research. In a small-scale study based on Salford Integrated Record (SIR) data, I have demonstrated that this mapping provides informative views of patient phenotypes across a population and allows the construction of clusters of patients sharing common diagnosis and treatments. The next phase of the research was to develop this methodology and explore its application using larger patient cohorts. This data contains more precise relationships between features than small-scale data. It also leads to the understanding of distinct population patterns and extracting common features. For such reasons, I applied the mapping methodology to patient records from the CPRD database. The study data set consisted of anonymised patient records for a population of 2.7 million patients. The work done in this analysis shows that methodology scales as O(n) in ways that did not require large computing resources. The low dimensional visualisation of high dimensional patient data allowed the identification of different subpopulations of patients across the study data set, where each subpopulation consisted of patients sharing similar characteristics such as age, gender and certain types of diseases. A key finding of this research is the wealth of data that can be produced. In the first use case of looking at the stratification of patients with falls, the methodology gave important hypotheses; however, this work has barely scratched the surface of how this mapping could be used. It opens up the possibility of applying a wide range of data mining strategies that have not yet been explored. What the thesis has shown is one strategy that works, but there could be many more. Furthermore, there is no aspect of the implementation of this methodology that restricts it to medical data. The same methodology could equally be applied to the analysis and visualisation of many other sources of data that are described using terms from taxonomies or ontologies.
|
438 |
Eye tracking scanpath trend analysis on Web pagesEraslan, Sukru January 2016 (has links)
Web pages are typically comprised of different kinds of visual elements such as menus, headers and footers. To improve user experience, eye tracking has been widely used to investigate how users interact with such elements. In particular, eye movement sequences, called scanpaths, have been analysed to understand the path that people follow in terms of these elements. However, individual scanpaths are typically complicated and they are related to specific users, and therefore any processing done with those scanpaths will be specific to individuals and will not be representative of multiple users. Therefore, those scanpaths should be clustered to provide a general direction followed by users. This direction will allow researchers to better understand user interactions with web pages, and then improve the design of the pages accordingly. Existing research tends to provide a very short scanpath which is not representative for understanding user behaviours. This thesis introduces a new algorithm for clustering scanpaths, called Scanpath Trend Analysis (STA). In contrast to existing research, in STA, if a particular element is not shared by all users but it gets at least the same attention as the fully shared elements, it is included in the resulting scanpath. Thus, this algorithm provides a richer understanding of how users interact with web pages. The STA algorithm was evaluated with a series of eye tracking studies where the web pages used were automatically segmented into their visual elements by using different approaches. The results show that the outputs of the STA algorithm are significantly more similar to the inputted scanpaths in comparison with the outputs of other existing work, and this is not limited to a particular segmentation approach. The effects of the number of users were also investigated on the STA algorithm as the number of users required for scanpath analysis has not been studied in depth in the literature. The results show the possibility to reach the same results with a smaller group of users. The research presented in this thesis should be of value to eye tracking researchers, to whom the STA algorithm has been made available to analyse scanpaths, and to behaviour analysis researchers, who can use the algorithm to understand user behaviours on web pages, and then design, develop and present the pages accordingly.
|
439 |
Semi-supervised document clustering with active learning. / CUHK electronic theses & dissertations collectionJanuary 2008 (has links)
Most existing semi-supervised document clustering approaches are model-based clustering and can be treated as parametric model taking an assumption that the underlying clusters follow a certain pre-defined distribution. In our semi-supervised document clustering, each cluster is represented by a non-parametric probability distribution. Two approaches are designed for incorporating pairwise constraints in the document clustering approach. The first approach, term-to-term relationship approach (TR), uses pairwise constraints for capturing term-to-term dependence relationships. The second approach, linear combination approach (LC), combines the clustering objective function with the user-provided constraints linearly. Extensive experimental results show that our proposed framework is effective. / This thesis presents a new framework for automatically partitioning text documents taking into consideration of constraints given by users. Semi-supervised document clustering is developed based on pairwise constraints. Different from traditional semi-supervised document clustering approaches which assume pairwise constraints to be prepared by user beforehand, we develop a novel framework for automatically discovering pairwise constraints revealing the user grouping preference. Active learning approach for choosing informative document pairs is designed by measuring the amount of information that can be obtained by revealing judgments of document pairs. For this purpose, three models, namely, uncertainty model, generation error model, and term-to-term relationship model, are designed for measuring the informativeness of document pairs from different perspectives. Dependent active learning approach is developed by extending the active learning approach to avoid redundant document pair selection. Two models are investigated for estimating the likelihood that a document pair is redundant to previously selected document pairs, namely, KL divergence model and symmetric model. / Huang, Ruizhang. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 70-06, Section: B, page: 3600. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 117-123). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
|
440 |
Some new developments for quantile regressionLiu, Xi January 2018 (has links)
Quantile regression (QR) (Koenker and Bassett, 1978), as a comprehensive extension to standard mean regression, has been steadily promoted from both theoretical and applied aspects. Bayesian quantile regression (BQR), which deals with unknown parameter estimation and model uncertainty, is a newly proposed tool of QR. This thesis aims to make some novel contributions to the following three issues related to QR. First, whereas QR for continuous responses has received much attention in literatures, QR for discrete responses has received far less attention. Second, conventional QR methods often show that QR curves crossing lead to invalid distributions for the response. In particular, given a set of covariates, it may turn out, for example, that the predicted 95th percentile of the response is smaller than the 90th percentile for some values of the covariates. Third, mean-based clustering methods are widely developed, but need improvements to deal with clustering extreme-type, heavy tailed-type or outliers problems. This thesis focuses on methods developed over these three challenges: modelling quantile regression with discrete responses, ensuring non-crossing quantile curves for any given sample and modelling tails for collinear data with outliers. The main contributions are listed as below: * The first challenge is studied in Chapter 2, in which a general method for Bayesian inference of regression models beyond the mean with discrete responses is developed. In particular, this method is developed for both Bayesian quantile regression and Bayesian expectile regression. This method provides a direct Bayesian approach to these regression models with a simple and intuitive interpretation of the regression results. The posterior distribution under this approach is shown to not only be coherent to the response variable, irrespective of its true distribution, but also proper in relation to improper priors for unknown model parameters. * Chapter 3 investigates a new kernel-weighted likelihood smoothing quantile regression method. The likelihood is based on a normal scale-mixture representation of an asymmetric Laplace distribution (ALD). This approach benefits of the same good design adaptation just as the local quantile regression (Spokoiny et al., 2014) does and ensures non-crossing quantile curves for any given sample. * In Chapter 4, we introduce an asymmetric Laplace distribution to model the response variable using profile regression, a Bayesian non-parametric model for clustering responses and covariates simultaneously. This development allows us to model more accurately for clusters which are asymmetric and predict more accurately for extreme values of the response variable and/or outliers. In addition to the three major aforementioned challenges, this thesis also addresses other important issues such as smoothing extreme quantile curves and avoiding insensitive to heteroscedastic errors as well as outliers in the response variable. The performances of all the three developments are evaluated via both simulation studies and real data analysis.
|
Page generated in 0.1017 seconds