351 |
Algoritmos e técnicas de validação em agrupamento de dados multi-representados, agrupamento possibilístico e bi-agrupamento / Algorithms and validation techniques in multi-represented data clustering, possibilistic clustering and bi-clusteringDanilo Horta 25 November 2013 (has links)
Existem bases para as quais os dados são naturalmente representados por mais de uma visão. Por exemplo, imagens podem ser descritas por atributos de cores, textura e forma. Proteínas podem ser caracterizadas pela sequência de aminoácidos e pela representação tridimensional. A unificação das diferentes visões de uma base de dados pode ser problemática porque elas podem não ser comparáveis entre si ou podem apresentar diferentes graus de importância. Esses graus de importância podem, inclusive, se manifestar de maneira local, de acordo com a subestrutura dos dados em questão. Isso motivou o surgimento de algoritmos de agrupamento de dados capazes de lidar com bases multi-representadas (i.e., que possuem mais de uma visão dos dados), como o algoritmo SCAD. Esse algoritmo se mostrou promissor em experimentos relatados na literatura, mas possui problemas críticos identificados neste trabalho que o impedem de funcionar em determinados cenários. Tais problemas foram solucionados por meio da proposição de uma nova versão do algoritmo, denominada ASCAD, fundamentada em provas formais sobre a sua convergência. Foram desenvolvidas versões relacionais do algoritmo ASCAD, capazes de lidar com bases descritas apenas por relações de proximidade entre os objetos. Foi desenvolvido também um índice de validação interna e relativa de agrupamento voltado para dados multi-representados. A avaliação de agrupamento possibilístico e de bi-agrupamento por meio da comparação entre solução encontrada e solução de referência (validação externa) também foi explorada. Algoritmos de bi-agrupamento têm ganhado um interesse crescente da comunidade de análise de expressão gênica. No entanto, pouco se conhece do comportamento e das propriedades das medidas voltadas para validação externa de bi-agrupamento, o que motivou uma análise teórica e empírica dessas medidas. Essa análise mostrou que a maioria das medidas de biagrupamento possui problemas críticos e destacou duas delas como sendo as mais promissoras. Foram inclusas nessa análise três medidas de agrupamento particional não exclusivo, cujo uso na comparação de bi-agrupamentos é possível por meio de uma nova abordagem de avaliação de bi-agrupamento proposta nesta tese. Agrupamento particional não exclusivo faz parte de um domínio mais geral de soluções, i.e., o domínio dos agrupamentos possibilísticos. Observou-se algumas falhas conceituais importantes das medidas de agrupamento possibilístico, o que motivou o desenvolvimento de novas medidas e de uma análise empírica e conceitual envolvendo 34 medidas. Uma das medidas propostas se destacou como sendo a única que apresentou avaliações imparciais com relação ao número de grupos, o valor máximo de similaridade ao comparar a solução ideal encontrada com a solução de referência e avaliações sensíveis às diferenças das soluções em todos os cenários considerados / There are data sets for which the instances are naturally represented by more than one view. For example, images can be described by attributes of color, texture, and shape. Proteins can be characterized by the amino acid sequence and by their three-dimensional description. The unification of different views of a data set can be problematic because they may not be comparable or may have different degrees of importance. These degrees of importance may even manifest itself locally, according to the data substructures. This prompted the emergence of clustering algorithms capable of handling multi-represented data sets (i.e., data sets having more than one view) as the SCAD algorithm. This algorithm has shown promising results in experiments reported in the literature, but it has critical problems identified in this work that hinder its application in certain scenarios. These problems were solved here by proposing a new version of the algorithm, called ASCAD, based on formal proofs about its correctness. We developed relational versions for ASCAD, capable of handling data sets described only by the proximities between the instances. We also developed an index for internal and relative validation of multi-represented data clusterings. The evaluation of possibilistic clustering and bi-clustering by comparing the found and reference solutions (external validation) was also explored. Bi-clustering algorithms have gained increasing interest from the community of gene expression analysis. However, little is known of the behavior and properties of the measures aimed at external validation of bi-clustering, which motivated a theoretical and empirical analysis of these measures in this work. This analysis showed that most bi-clustering measures has critical issues and highlighted two of the measures as being the most promising. We included in this analysis three measures of non-exclusive partitional clustering, whose use in comparing bi-clusterings is possible through a new approach proposed in this thesis. Non-exclusive partitional clustering belong to a more general domain of solutions, i.e., the domain of possibilistic clusterings. There are some important conceptual flaws in the measures of possibilistic clustering, which motivated us to develop new measures and to conceptually and empirically analyse 34 measures. One of the proposed measures stood out as being the one who presented unbiased evaluations regarding the number of clusters, the maximum similarity when comparing the optimal solution with the reference one, and evaluations sensitive to solution differences in all scenarios considered
|
352 |
Étude exhaustive de voies de signalisation de grande taille par clustering des trajectoires et caractérisation par analyse sémantique / Comprehensive study of large signaling pathways by clustering trajectories and characterization by semantic analysisCoquet, Jean 20 December 2017 (has links)
Les voies de signalisation décrivent les réponses d'une cellule à des stimuli externes. Elles sont primordiales dans les processus biologiques tels que la différentiation, la prolifération ou encore l'apoptose. La biologie des systèmes tentent d'étudier ces voies de façon exhaustive à partir de modèles statistiques ou dynamiques. Le nombre de solutions expliquant un phénomène biologique (par exemple la réaction d'une cellule à un stimulus) peut être très élevé dans le cas de grands modèles. Cette thèse propose, dans un premier temps, différentes stratégies de regroupement de ces solutions à partir de méthodes de clustering et d'analyse de concepts formels. Puis elle présente la caractérisation de ces regroupements à partir de web sémantique. Ces stratégies ont été appliquées au réseau de signalisation du TGF-beta, un stimulus extra-cellulaire jouant un rôle important dans le développement du cancer, ce qui a permis d'identifier cinq grands groupes de trajectoires participant chacun à des processus biologiques différents. Dans un second temps, cette thèse se confronte au problème de conversion des données hétérogènes provenant de différentes bases dans un formalisme unique afin de pouvoir généraliser l'étude précédente. Elle propose une stratégie permettant de regrouper les différents réseaux de signalisation provenant d'une base de données en un modèle unique et ainsi permettant de calculer toutes les trajectoires de signalisation d'un stimulus. / Signaling pathways describe the extern stimuli responses of a cell. They are indispensable in biological processes such as differentiation, proliferation or apoptosis. The Systems Biology tries to study exhaustively the signalling pathways using static or dynamic models. The number of solutions which explain a biological phenomenon (for example the stimulus reaction of cell) can be very high in large models. First, this thesis proposes some different strategies to group the solutions describing the stimulus signalling with clustering methods and Formal Concept Analysis. Then, it presents the cluster characterization with semantic web methods. Those strategies have been applied to the TGF-beta signaling network, an extracellular stimulus playing an important role in the cancer growing, which helped to identify 5 large groups of trajectories characterized by different biological processes. Next, this thesis confronts the problem of heterogeneous data translation from different bases to a unique formalism. The goal is to be able to generalize the previous study. It proposes a strategy to group signaling pathways of a database to an unique model, then to calculate every signaling trajectory of the stimulus.
|
353 |
Découverte et exploitation de proportions analogiques dans les bases de données relationnelles / Discovering and exploiting analogical proportions in a relational database contextCorrea Beltran, William 18 July 2016 (has links)
Dans cette thèse, nous nous intéressons aux proportions analogiques dans le contexte des bases de données relationnelles. Les proportions analogiques permettent de lier quatre éléments dans une relation du type ''A est à B ce que C est à D''. Par exemple, « Paris est à la France ce que Rome est à l'Italie ». Nous avons étudié le problème de la prédiction de valeurs manquantes dans une base de données en utilisant les proportions analogiques. Un algorithme de classification fondé sur les proportions analogiques a été modifié afin de résoudre ce problème. Puis, nous avons étudié les propriétés des éléments appartenant à l'ensemble d'apprentissage des classificateurs analogiques fréquemment exploités pour calculer la prédiction. Ceci nous a permis de réduire considérablement la taille de cet ensemble par élimination des éléments peu pertinents et par conséquent, de diminuer les temps d'exécution de ces classificateurs. La deuxième partie de la thèse a pour objectif de découvrir de nouveaux patrons basés sur la relation d'analogie, i.e., des parallèles, dans les bases de données. Nous avons montré qu'il est possible d'extraire ces patrons en s'appuyant sur des approches de clustering. Les clusters produits par de telles techniques présentent aussi un intérêt pour l'évaluation de requêtes recherchant des patrons d'analogie dans les bases de données. Dans cette perspective, nous avons proposé d'étendre le langage de requêtes SQL pour pouvoir trouver des quadruplets d'une base de données satisfaisant une proportion analogique. Nous avons proposé différentes stratégies d'évaluation pour de telles requêtes, et avons comparé expérimentalementleurs performances. / In this thesis, we are interested in the notion of analogical proportions in a relational database context. An analogical proportion is a statement of the form “A is to B as C is to D”, expressing that the relation beween A and B is the same as the relation between C and D. For instance, one may say that “Paris is to France as Rome is to Italy”. We studied the problem of imputing missing values in a relational database by means of analogical proportions. A classification algorithm based on analogical proportions has been modified in order to impute missing values. Then, we studied how analogical classifiers work in order to see if their processing could be simplified. We showed how some typeof analogical proportions is more useful than the others when performing classification. We then proposed an algorithm using this information, which allowed us to considerably reduce the size of the training set used by the analogical classificationalgorithm, and hence to reduce its execution time. In the second part of this thesis, we payed a particular attention to the mining of combinations of four tuples bound by an analogical relationship. For doing so, we used several clustering algorithms, and we proposed some modifications to them, in order tomake each obtained cluster represent a set of analogical proportions. Using the results of the clustering algorithms, we studied how to efficiently retrieve the analogical proportions in a database by means of queries. For doing so, we proposed to extend the SQL query language in order to retrieve from a database the quadruples of tuples satisfying an analogical proportion. We proposed severalquery evaluation strategies and experimentally compared their performances.
|
354 |
Diagrammes d’Euler pour la visualisation de communautés et d’ensembles chevauchants / Visualisation of overlapping sets and clusters with Euler diagramsSimonetto, Paolo 02 December 2011 (has links)
Dans cette thèse, nous proposons une méthode pour la visualisation d'ensembles chevauchant et de basé sur les diagrammes d'Euler. Les diagrammes d'Euler sont probablement les plus intuitifs pour représenter de manière schématique les ensembles qui partagent des éléments. Cette métaphore visuelle est ainsi un outil puissant en termes de visualisation d'information. Cependant, la génération automatique de ces diagrammes présente encore de nombreux problèmes difficiles. Premièrement, tous les clustering chevauchants ne peuvent pas être dessinées avec les diagrammes d'Euler classiques. Deuxièmement, la plupart des algorithmes existants permettent uniquement de représenter les diagrammes de dimensions modestes. Troisièmement, les besoins des applications réelles requièrent un processus plus fiable et plus rapide.Dans cette thèse, nous décrivons une version étendue des diagrammes d'Euler. Cette extension permet de modéliser l'ensemble des instances de la classe des clustering chevauchants. Nous proposons ensuite un algorithme automatique de génération de cette extension des diagrammes d'Euler. Enfin, nous présentons une implémentation logicielle et des expérimentations de ce nouvel algorithme. / In this thesis, we propose a method for the visualisation of overlapping sets and of fuzzy graph clusterings based on Euler diagrams.Euler diagrams are probably the most intuitive and most used method to depict sets in which elements can be shared. Such a powerful visualisation metaphor could be an invaluable visualisation tool, but the automatic generation of Euler diagrams still presents many challenging problems. First, not all instances can be drawn using standard Euler diagrams. Second, most existing algorithms focus on diagrams of modest dimensions while real-world applications typically features much larger data. Third, the generation process must be reliable and reasonably fast.In this thesis, we describe an extended version of Euler diagrams that can be produced for every input instance. We then propose an automatic procedure for the generation of such diagrams that specifically target large input instances. Finally, we present a software implementation of this method and we describe some output examples generated on real-world data.
|
355 |
Visualização, kernels e subespaços: um estudo prático / Visualization, kernels and subspace: a practical studyAdriano Oliveira Barbosa 16 December 2016 (has links)
Dados de alta dimensão são tipicamente tratados como pertencentes a um único subespaço do espaço onde estão imersos. Entretanto, dados utilizados em aplicações reais estão usualmente distribuídos entre subespaços independentes e com dimensões distintas. Um objeto de estudo surge a partir dessa afirmação: como essa distribuição em subespaços independentes pode auxiliar tarefas de visualização? Por outro lado, se o dado parece estar embaralhado nesse espaço de alta dimensão, como visualizar seus padrões e realizar tarefas como classificação? Podemos, por exemplo, mapear esse dado num outro espaço utilizando uma função capaz de o desembaralhar, de modo que os padrões intrínsecos fiquem mais claros e, assim, facilitando nossa tarefa de visualização ou classificação. Essa Tese apresenta dois estudos que abordam ambos os problemas. Para o primeiro, utilizamos técnicas de subspace clustering para definir, quando existente, a estrutura de subespaços do dado e estudamos como essa informação pode auxiliar em visualizações utilizando projeções multidimensionais. Para o segundo problema, métodos de kernel, bastante conhecidos na literatura, são as ferramentas a nos auxiliar. Utilizamos a medida de similaridade do kernel para desenvolver uma nova técnica de projeção multidimensional capaz de lidar com dados imersos no espaço de características induzido implicitamente pelo kernel. / High-dimensional data are typically handled as laying in a single subspace of the original space. However, data involved in real applications are usually spread around in distinct subspaces which may have different dimensions. We would like to study how the subspace structure information can be used to improve visualization tasks. On the other hand, what if the data is tangled in this high-dimensional space, how to visualize its patterns or how to accomplish classification tasks? One could, for example, map the data in another high-dimensional space using amapping capable of untangle the data making the patterns clear, rendering the visualization or classification an easy task. This dissertation presents an study for both problems pointed out above. For the former, we use subspace clustering techniques to define, when it exists, a subspace structure, studying how this information can be used to support visualization tasks based on multidimensional projections. For the latter problem we employ kernel methods, well known in the literature, as a tool to assist visualization tasks. We use a similarity measure given by the kernel to develop acompletely new multidimensional projection technique capable of dealing with data embedded in the implicit feature space defined by the kernel.
|
356 |
Intégration des facteurs prédictifs de l'effet d'un traitement dans la conception et l'analyse des essais cliniques de petite taille : application à la maladie de Huntington. / Integration of predictive factors of treatment effect in design and analyse of clinical trials with small sample size : application on Huntington's diseaseSchramm, Catherine 06 July 2016 (has links)
La maladie de Huntington est neurodégénérative, génétique, rare, multifacette et de durée d'évolution longue, induisant une grande. Les biothérapies en cours d'essai sont réalisées sur des petits effectifs, avec un effet mesurable à long terme et hétérogène. Identifier des marqueurs d'évolution de la maladie et de réponse au traitement permettrait de mieux comprendre et d'améliorer les résultats des futurs essais cliniques. Nous avons développé une méthode de clustering pour l'efficacité d'un traitement dans le cadre de données longitudinales afin de définir des répondeurs et non répondeurs au traitement. Notre méthode, robuste pour les petits effectifs, combine un modèle linéaire mixte à deux pentes et un algorithme de clustering. Le modèle mixte génère des effets aléatoires, associés à la réponse au traitement, propres à chaque patient. L'algorithme de clustering permet de définir des sous-groupes selon la valeur des effets aléatoires. Trouver des sous-groupes de patients répondeurs permet de définir des marqueurs prédictifs de la réponse au traitement qui seront utilisés pour donner le traitement le mieux adapté à chaque patient. Nous avons discuté de l'intégration (i) des marqueurs prédictifs dans les plans expérimentaux des essais cliniques, en évaluant leur impact sur la puissance de l'étude; et (ii) des marqueurs pronostiques, en étudiant l¿impact du polymorphisme COMT sur le déclin cognitif des patients. Enfin, nous avons évalué l'effet d'apprentissage des tests neuropsychologiques, et montré comment une double évaluation à l'inclusion dans un essai clinique permettait de s'en affranchir quand le critère de jugement principal est le déclin cognitif. / Huntington's disease is neurodegenerative, genetic, rare, multifaceted and has a long evolution, inducing heterogeneity of conditions and progression of the disease. Current biotherapy trials are performed on small samples of patients, with a treatment effect measurable in the long-term that is heterogeneous. Identifying markers of the disease progression and of the treatment response may help to better understand and improve results of biotherapy studies in Huntington's disease. We have developed a clustering method for the treatment efficacy in the case of longitudinal data in order to identify treatment responders and nonresponders. Our method combines a linear mixed model with two slopes and a classical clustering algorithm. The mixed model generates random effects associated with treatment response, specific to each patient. The clustering algorithm is used to define subgroups according to the value of the random effects. Our method is robust in case of small samples. Finding subgroups of responders may help to define predictive markers of treatment response which will be used to give the most appropriate treatment for each patient. We discussed integration of (i) the predictive markers in study design of future clinical trials, assessing their impact on the power of the study; and (ii) the prognostic markers of disease progression by studying the COMT polymorphism as a prognostic marker of cognitive decline in Huntington's disease. Finally, we evaluated the learning effect of neuropsychological tasks measuring cognitive abilities, and showed how a double baseline in a clinical trial could take it into account when the primary outcome is the cognitive decline.
|
357 |
Utilisation des modèles de co-clustering pour l'analyse exploratoire des données / No English title availableGuigourès, Romain 04 December 2013 (has links)
Le co-clustering est une technique de classification consistant à réaliser une partition simultanée des lignes et des colonnes d’une matrice de données. Parmi les approches existantes, MODL permet de traiter des données volumineuses et de réaliser une partition de plusieurs variables, continues ou nominales. Nous utilisons cette approche comme référence dans l’ensemble des travaux de la thèse et montrons la diversité des problèmes de data mining pouvant être traités, comme le partitionnement de graphes, de graphes temporels ou encore le clustering de courbes. L’approche MODL permet d’obtenir des résultats fins sur des données volumineuses, ce qui les rend difficilement interprétables. Des outils d’analyse exploratoire sont alors nécessaires pour les exploiter. Afin de guider l'utilisateur dans l'interprétation de tels résultats, nous définissons plusieurs outils consistant à simplifier des résultats fins afin d’en avoir une interprétation globale, à détecter les clusters remarquables, à déterminer les valeurs représentatives de leurs clusters et enfin à visualiser les résultats. Les comportements asymptotiques de ces outils d’analyse exploratoire sont étudiés afin de faire le lien avec les approches existantes.Enfin une application sur des comptes-rendus d’appels de l’opérateur Orange, collectés en Côte d’Ivoire, montre l’intérêt de l’approche et des outils d’analyse exploratoire dans un contexte industriel. / Co-clustering is a clustering technique aiming at simultaneously partitioning the rows and the columns of a data matrix. Among the existing approaches, MODL is suitable for processing huge data sets with several continuous or categorical variables. We use it as the baseline approach in this thesis. We discuss the reliability of applying such an approach on data mining problems like graphs partitioning, temporal graphs segmentation or curve clustering.MODL tracks very fine patterns in huge data sets, that makes the results difficult to study. That is why, exploratory analysis tools must be defined in order to explore them. In order to help the user in interpreting the results, we define exploratory analysis tools aiming at simplifying the results in order to make possible an overall interpretation, tracking the most interesting patterns, determining the most representative values of the clusters and visualizing the results. We investigate the asymptotic behavior of these exploratory analysis tools in order to make the connection with the existing approaches.Finally, we highlight the value of MODL and the exploratory analysis tools owing to an application on call detailed records from the telecom operator Orange, collected in Ivory Coast.
|
358 |
Vocation Clustering for Heavy-Duty VehiclesDaniel Patrick Kobold Jr (9719936) 07 January 2021 (has links)
<p>The identification of the vocation of an unknown heavy-duty
vehicle is valuable to parts manufacturers who may not have otherwise access to
this information on a consistent basis. This study proposes a methodology for
vocation identification that is based on clustering techniques. Two clustering algorithms are considered: K-Means
and Expectation Maximization. These algorithms are used to first construct the
operating profile of each vocation from a set of vehicles with known vocations.
The vocation of an unknown vehicle is then determined using different
assignment methods.</p>
<p> </p>
<p>These methods fall under two main categories: one-versus-all
and one-versus-one. The one-versus-all approach compares an unknown vehicle to
all potential vocations. The one-versus-one approach compares the unknown
vehicle to two vocations at a time in a tournament fashion. Two types of tournaments
are investigated: round-robin and bracket. The accuracy and efficiency of each
of the methods is evaluated using the NREL FleetDNA dataset.</p>
<p> </p>
<p>The study revealed that some of the vocations may have
unique operating profiles and are therefore easily distinguishable from others.
Other vocations, however, can have confounding profiles. This indicates that
different vocations may benefit from profiles with varying number of clusters. Determining
the optimal number of clusters for each vocation can not only improve the
assignment accuracy, but also enhance the computational efficiency of the
application. The optimal number of clusters for each vocation is determined
using both static and dynamic techniques. Static approaches refer to methods
that are completed prior to training and may require multiple iterations. Dynamic
techniques involve clusters being split or removed during training. The results
show that the accuracy of dynamic techniques is comparable to that of static
approaches while benefiting from a reduced computational time.</p>
|
359 |
Sparse subspace clustering-based motion segmentation with complete occlusion handlingMattheus, Jana January 2021 (has links)
Motion segmentation is part of the computer vision field and aims to find the moving parts in a video sequence. It is used in applications such as autonomous driving, surveillance, robotics, human motion analysis, and video indexing. Since there are so many applications, motion segmentation is ill-defined and the research field is vast. Despite the advances in the research over the years, the existing methods are still far behind human capabilities. Problems such as changes in illumination, camera motion, noise, mixtures of motion, missing data, and occlusion remain challenges.
Feature-based approaches have grown in popularity over the years, especially manifold clustering methods due to their strong mathematical foundation. Methods exploiting sparse and low-rank representations are often used since the dimensionality of the data is reduced while useful information regarding the motion segments is extracted. However, these methods are unable to effectively handle large and complete occlusions as well as missing data since they tend to fail when the amount of missing data becomes too large. An algorithm based on Sparse Subspace Clustering (SSC) has been proposed to address the issue of occlusions and missing data so that SSC can handle these cases with high accuracy. A frame-to-frame analysis was adopted as a pre-processing step to identify motion segments between consecutive frames, called inter-frame motion segments. The pre-processing step is called Multiple Split-And-Merge (MSAM), which is based on the classic top-down split-and-merge algorithm. Only points present in both frame pairs are segmented. This means that a point undergoing an occlusion is only assigned to a motion class when it has been visible for two consecutive frames after re-entering the camera view. Once all the inter-frame segments have been extracted, the results are combined in a single matrix and used as the input for the classic SSC algorithm. Therefore, SSC segments inter-frame motion segments rather than point trajectories. The resulting algorithm is referred to as MSAM-SSC.
MSAM-SSC outperformed some of the most popular manifold clustering methods on the Hopkins155 and KT3DMoSeg datasets. It was also able to handle complete occlusions and 50% missing data sequences, as well as outliers. The algorithm can handle mixtures of motions and different numbers of motions. However, it was found that MSAM-SSC is more suited for traffic and articulate motion scenes which are often used in applications such as robotics, surveillance, and autonomous driving. For future work, the algorithm can be optimised to reduce the execution time so that it can be used for real-time applications. Additionally, the number of moving objects in the scene can be estimated to obtain a method that does not rely on prior knowledge. / Dissertation (MEng (Computer Engineering))--University of Pretoria, 2021. / CSIR / Electrical, Electronic and Computer Engineering / MEng (Computer Engineering) / Unrestricted
|
360 |
Segmentace stránky ve webovém prohlížeči / Page Segmentation in a Web BrowserZubrik, Tomáš January 2021 (has links)
This thesis deals with the web page segmentation in a web browser. The implementation of Box Clustering Segmentation (BCS) method in JavaScript using an automated browser was created. The actual implementation consists of two main steps, which are the box extraction (leaf DOM nodes) from the browser context and their subsequent clustering based on the similarity model defined in BCS. Main result of this thesis is a functional implementation of BCS method usable for web page segmentation. The evaluation of the functionality and accuracy of the implementation is based on a comparison with a reference implementation created in Java.
|
Page generated in 0.1027 seconds