Global ETD Search

61	Selecionando candidatos a descritores para agrupamentos hierárquicos de documentos utilizando regras de associação / Selecting candidate labels for hierarchical document clusters using association rules Santos, Fabiano Fernandes dos 17 September 2010 (has links) Uma forma de extrair e organizar o conhecimento, que tem recebido muita atenção nos últimos anos, é por meio de uma representação estrutural dividida por tópicos hierarquicamente relacionados. Uma vez construída a estrutura hierárquica, é necessário encontrar descritores para cada um dos grupos obtidos pois a interpretação destes grupos é uma tarefa complexa para o usuário, já que normalmente os algoritmos não apresentam descrições conceituais simples. Os métodos encontrados na literatura consideram cada documento como uma bag-of-words e não exploram explicitamente o relacionamento existente entre os termos dos documento do grupo. No entanto, essas relações podem trazer informações importantes para a decisão dos termos que devem ser escolhidos como descritores dos nós, e poderiam ser representadas por regras de associação. Assim, o objetivo deste trabalho é avaliar a utilização de regras de associação para apoiar a identificação de descritores para agrupamentos hierárquicos. Para isto, foi proposto o método SeCLAR (Selecting Candidate Labels using Association Rules), que explora o uso de regras de associação para a seleção de descritores para agrupamentos hierárquicos de documentos. Este método gera regras de associação baseadas em transações construídas à partir de cada documento da coleção, e utiliza a informação de relacionamento existente entre os grupos do agrupamento hierárquico para selecionar candidatos a descritores. Os resultados da avaliação experimental indicam que é possível obter uma melhora significativa com relação a precisão e a cobertura dos métodos tradicionais / One way to organize knowledge, that has received much attention in recent years, is to create a structural representation divided by hierarchically related topics. Once this structure is built, it is necessary to find labels for each of the obtained clusters, since most algorithms do not produce simple descriptions and the interpretation of these clusters is a difficult task for users. The related works consider each document as a bag-of-words and do not explore explicitly the relationship between the terms of the documents. However, these relationships can provide important information to the decision of the terms that must be chosen as descriptors of the nodes, and could be represented by rass. This works aims to evaluate the use of association rules to support the identification of labels for hierarchical document clusters. Thus, this paper presents the SeCLAR (Selecting Candidate Labels using Association Rules) method, which explores the use of association rules for the selection of good candidates for labels of hierarchical clusters of documents. This method generates association rules based on transactions built from each document in the collection, and uses the information relationship between the nodes of hierarchical clustering to select candidates for labels. The experimental results show that it is possible to obtain a significant improvement with respect to precision and recall of traditional methods Agrupamento hierárquico de documantos Association rules Hierarchical document clustering Label hierarchical clustering Mineração de texto Regras de associação Text mining
62	Word Clustering in an Interactive Text Analysis Tool / Klustring av ord i ett interaktivt textanalysverktyg Gränsbo, Gustav January 2019 (has links) A central operation of users of the text analysis tool Gavagai Explorer is to look through a list of words and arrange them in groups. This thesis explores the use of word clustering to automatically arrange the words in groups intended to help users. A new word clustering algorithm is introduced, which attempts to produce word clusters tailored to be small enough for a user to quickly grasp the common theme of the words. The proposed algorithm computes similarities among words using word embeddings, and clusters them using hierarchical graph clustering. Multiple variants of the algorithm are evaluated in an unsupervised manner by analysing the clusters they produce when applied to 110 data sets previously analysed by users of Gavagai Explorer. A supervised evaluation is performed to compare clusters to the groups of words previously created by users of Gavagai Explorer. Results show that it was possible to choose a set of hyperparameters deemed to perform well across most data sets in the unsupervised evaluation. These hyperparameters also performed among the best on the supervised evaluation. It was concluded that the choice of word embedding and graph clustering algorithm had little impact on the behaviour of the algorithm. Rather, limiting the maximum size of clusters and filtering out similarities between words had a much larger impact on behaviour. word clustering word embedding distributional semantics hierarchical clustering text analytics language technology natural language processing gavagai
63	Essai sur la nature des travailleurs indépendants français : une approche socioéconomique / Essay on the nature of the French self-employed : a socioeconomical approach Rapelli, Stéphane 14 January 2011 (has links) L'objectif de cette thèse est de proposer une méthode de repérage empirique robuste du travailleur indépendant français. En effet, l'absence de norme homogène atténue la portée des travaux économétriques et statistiques. Dans un premier chapitre, les fondements de l'indépendance professionnelle sont mis en avant par une approche historique. Le deuxième chapitre permet de formuler des hypothèses typologiques consécutivement à l'examen des normes juridiques et du corpus empirique.Un idéaltype empirique est proposé dans le troisième chapitre. Il est formulé en opposant les hypothèses à l'analyse des résultats de classifications ascendantes hiérarchiques effectuées sur différents échantillons de travailleurs. La norme ainsi proposée permet de repérer objectivement les indépendants français au regard de critères inhérents au métier effectivement exercé, aux statuts entrepreneuriaux, à la taille de l'entreprise et au secteur d'activité. Cet idéaltype empirique permet un repérage robuste des indépendants au sein de la population des travailleurs non-salariés. / The objective of this thesis is to propose an robust empirical method to stake out French self-employed.Indeed, the absence of homogeneous standard tones down the reach of econometric and statistical works.In a first chapter, the foundations of the professional independence are studed through historic approach.The second chapter allows to formulate typological hypotheses as a result of the examination of the legalrules and the empirical corpus. An empirical idealtype is proposed in the third chapter. It is formulated bysetting the hypotheses against the analysis of the results of hierarchical ascending classifications made onworkers' various samples. The standard proposed here allows to stake out objectively the French selfemployedtowards criteria inherent to the effectively exercised occupation, to the entrepreneurial statuses,to the size of the enterprise and to the business sector. This empirical idealtype allows a robust location ofthe self-employed workers inside the population of non-salaried workers. Travailleur indépendant Non-salariat Classification ascendante hiérarchique Professions indépendantes Normes juridiques Normes statistiques Self-employed Non-salaried work Hierarchical clustering Independant occupations Legal rules Statistical standarts
64	Image Characterization by Morphological Hierarchical Representations / Caractérisation d'images par des représentations morphologiques hiérarchiques Fehri, Amin 25 May 2018 (has links) Cette thèse porte sur l'extraction de descripteurs hiérarchiques et multi-échelles d'images, en vue de leur interprétation, caractérisation et segmentation. Elle se décompose en deux parties.La première partie expose des éléments théoriques et méthodologiques sur l'obtention de classifications hiérarchiques des nœuds d'un graphe valué aux arêtes. Ces méthodes sont ensuite appliquées à des graphes représentant des images pour obtenir différentes méthodes de segmentation hiérarchique d'images. De plus, nous introduisons différentes façons de combiner des segmentations hiérarchiques. Nous proposons enfin une méthodologie pour structurer et étudier l'espace des hiérarchies que nous avons construites en utilisant la distance de Gromov-Hausdorff entre elles.La seconde partie explore plusieurs applications de ces descriptions hiérarchiques d'images. Nous exposons une méthode pour apprendre à extraire de ces hiérarchies une bonne segmentation de façon automatique, étant donnés un type d'images et un score de bonne segmentation. Nous proposons également des descripteurs d'images obtenus par mesure des distances inter-hiérarchies, et exposons leur efficacité sur des données réelles et simulées. Enfin, nous étendons les potentielles applications de ces hiérarchies en introduisant une technique permettant de prendre en compte toute information spatiale a priori durant leur construction. / This thesis deals with the extraction of hierarchical and multiscale descriptors on images, in order to interpret, characterize and segment them. It breaks down into two parts.The first part outlines a theoretical and methodological approach for obtaining hierarchical clusterings of the nodes of an edge-weighted graph. In addition, we introduce different approaches to combine hierarchical segmentations. These methods are then applied to graphs representing images and derive different hierarchical segmentation techniques. Finally, we propose a methodology for structuring and studying the space of hierarchies by using the Gromov-Hausdorff distance as a metric.The second part explores several applications of these hierarchical descriptions for images. We expose a method to learn how to automatically extract a segmentation of an image, given a type of images and a score of evaluation for a segmentation. We also propose image descriptors obtained by measuring inter-hierarchical distances, and expose their efficiency on real and simulated data. Finally, we extend the potential applications of these hierarchies by introducing a technique to take into account any spatial prior information during their construction. Morphologie mathématique Apprentissage statistique Théorie des graphes Segmentation hiérarchique Classification hiérarchique Traitement d'images Segmentation Mathematical morphology Machine learning Graph theory Hierarchical segmentation Hierarchical clustering Image processing Segmentation 511
65	Toward Scalable Hierarchical Clustering and Co-clustering Methods : application to the Cluster Hypothesis in Information Retrieval / Méthodes de regroupement hiérarchique agglomératif et co-clustering, leurs applications aux tests d’hypothèse de cluster et implémentations distribuées Wang, Xinyu 29 November 2017 (has links) Comme une méthode d’apprentissage automatique non supervisé, la classification automatique est largement appliquée dans des tâches diverses. Différentes méthodes de la classification ont leurs caractéristiques uniques. La classification hiérarchique, par exemple, est capable de produire une structure binaire en forme d’arbre, appelée dendrogramme, qui illustre explicitement les interconnexions entre les instances de données. Le co-clustering, d’autre part, génère des co-clusters, contenant chacun un sous-ensemble d’instances de données et un sous-ensemble d’attributs de données. L’application de la classification sur les données textuelles permet d’organiser les documents et de révéler les connexions parmi eux. Cette caractéristique est utile dans de nombreux cas, par exemple, dans les tâches de recherche d’informations basées sur la classification. À mesure que la taille des données disponibles augmente, la demande de puissance du calcul augmente. En réponse à cette demande, de nombreuses plates-formes du calcul distribué sont développées. Ces plates-formes utilisent les puissances du calcul collectives des machines, pour couper les données en morceaux, assigner des tâches du calcul et effectuer des calculs simultanément.Dans cette thèse, nous travaillons sur des données textuelles. Compte tenu d’un corpus de documents, nous adoptons l’hypothèse de «bag-of-words» et applique le modèle vectoriel. Tout d’abord, nous abordons les tâches de la classification en proposant deux méthodes, Sim_AHC et SHCoClust. Ils représentent respectivement un cadre des méthodes de la classification hiérarchique et une méthode du co-clustering hiérarchique, basé sur la proximité. Nous examinons leurs caractéristiques et performances du calcul, grâce de déductions mathématiques, de vérifications expérimentales et d’évaluations. Ensuite, nous appliquons ces méthodes pour tester l’hypothèse du cluster, qui est l’hypothèse fondamentale dans la recherche d’informations basée sur la classification. Dans de tels tests, nous utilisons la recherche du cluster optimale pour évaluer l’efficacité de recherche pour tout les méthodes hiérarchiques unifiées par Sim_AHC et par SHCoClust . Nous aussi examinons l’efficacité du calcul et comparons les résultats. Afin d’effectuer les méthodes proposées sur des ensembles de données plus vastes, nous sélectionnons la plate-forme d’Apache Spark et fournissons implémentations distribuées de Sim_AHC et de SHCoClust. Pour le Sim_AHC distribué, nous présentons la procédure du calcul, illustrons les difficultés rencontrées et fournissons des solutions possibles. Et pour SHCoClust, nous fournissons une implémentation distribuée de son noyau, l’intégration spectrale. Dans cette implémentation, nous utilisons plusieurs ensembles de données qui varient en taille pour examiner l’échelle du calcul sur un groupe de noeuds. / As a major type of unsupervised machine learning method, clustering has been widely applied in various tasks. Different clustering methods have different characteristics. Hierarchical clustering, for example, is capable to output a binary tree-like structure, which explicitly illustrates the interconnections among data instances. Co-clustering, on the other hand, generates co-clusters, each containing a subset of data instances and a subset of data attributes. Applying clustering on textual data enables to organize input documents and reveal connections among documents. This characteristic is helpful in many cases, for example, in cluster-based Information Retrieval tasks. As the size of available data increases, demand of computing power increases. In response to this demand, many distributed computing platforms are developed. These platforms use the collective computing powers of commodity machines to parallelize data, assign computing tasks and perform computation concurrently.In this thesis, we first address text clustering tasks by proposing two clustering methods, Sim_AHC and SHCoClust. They respectively represent a similarity-based hierarchical clustering and a similarity-based hierarchical co-clustering. We examine their properties and performances through mathematical deduction, experimental verification and evaluation. Then we apply these methods in testing the cluster hypothesis, which is the fundamental assumption in cluster-based Information Retrieval. In such tests, we apply the optimal cluster search to evaluation the retrieval effectiveness of different clustering methods. We examine the computing efficiency and compare the results of the proposed tests. In order to perform clustering on larger datasets, we select Apache Spark platform and provide distributed implementation of Sim_AHC and of SHCoClust. For distributed Sim_AHC, we present the designed computing procedure, illustrate confronted difficulties and provide possible solutions. And for SHCoClust, we provide a distributed implementation of its core, spectral embedding. In this implementation, we use several datasets that vary in size to examine scalability. Classification ascendante hiérarchique Co-clustering Recherche d’informations Hypothèse de cluster Calcul distribué Hierarchical clustering Co-clustering Information Retrieval Cluster hypothesis Distributed computing
66	Applications of MALDI-TOF/MS combined with molecular imaging for breast cancer diagnosis Chiang, Yi-Yan 26 July 2011 (has links) The incidence of breast cancer became the most common female cancer, and the fourth cause of female cancer death. In this study, matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF/MS) have been combined with multivariate statistics to investigate breast cancer tissues and cell lines. Core needle biopsy and fine needle aspiration (FNA) are techniques largely applied in the diagnosis of breast cancer. In this study, we have established an efficient protocol for detecting breast tissue and FNA samples with MALDI-TOF/MS. With the help of statistical analysis software, we can find the lipid-derived ion signals which can be use to distinguish breast cancer tumor tissues from non-tumor parts. This strategy can differentiate normal and tumor tissue, which is potential to apply in clinical diagnoses. The analysis of breast cancer tissue is challenging as the complexity of the tissue sample. Direct tissue analyses by matrix-assisted laser desorption/ionization imaging mass spectrometry (MALDI-IMS) allows us to investigate the molecular structure and their distribution while maintaining the integrity of the tissue and avoiding the loss of signals from extraction steps. Combined MALDI-IMS with statistic software, tissues can be analyzed and classified based on their molecular content which is helpful to distinguish tumor regions from non-tumor regions of breast cancer tissue. Our result shows the differences in the distribution and content of lipids between tumor and non-tumor tissue which can be supplements of current pathological analysis in tumor margins. In this study, MALDI-TOF/MS combined with multivariate statistics were used to rapidly differentiate breast cancer cell lines with different estrogen receptor (ER) and human epidermal growth factor receptor 2 (HER2) status. The protocol for efficiently detecting peptides and proteins in breast cancer cells with MALDI-TOF/MS was established, two multivariate statistics including principle component analysis (PCA) and hierarchical clustering analysis were used to process the obtaining MALDI mass spectra of six different breast cancer cell lines and one normal breast cell lines. Based on the difference of the peptide and protein profiles, breast cancer cell lines with same ER and HER-2 status were grouped in nearby region on the PCA score plot. The results of hierarchical cluster analysis also revealed high conformity between breast cancer cell protein profiles and respective hormone receptor types. estrogen receptor human epidermal growth factor receptor 2 hierarchical clustering analysis principal component analysis multivariate statistics imaging mass spectrometry
67	Σχεδιασμός και ανάπτυξη αλγορίθμου συσταδοποίησης μεγάλης κλίμακας δεδομένων Γούλας, Χαράλαμπος January 2015 (has links) Υπό το φάσμα της νέας, ανερχόμενης κοινωνίας της πληροφορίας, η σύγκλιση των υπολογιστών με τις τηλεπικοινωνίες έχει οδηγήσει στην συνεχώς αυξανόμενη παραγωγή και αποθήκευση τεράστιου όγκου δεδομένων σχεδόν για οποιονδήποτε τομέα της ανθρώπινης ενασχόλησης. Αν, λοιπόν, τα δεδομένα αποτελούν τα καταγεγραμμένα γεγονότα της ανθρώπινης ενασχόλησης, οι πληροφορίες αποτελούν τους κανόνες, που τα διέπουν. Και η κοινωνία στηρίζεται και αναζητά διακαώς νέες πληροφορίες. Το μόνο που απομένει, είναι η ανακάλυψη τους. Ο τομέας, που ασχολείται με την συστηματική ανάλυση των δεδομένων με σκοπό την εξαγωγή χρήσιμης γνώσης ονομάζεται μηχανική μάθηση. Υπό αυτό, λοιπόν, το πρίσμα, η παρούσα διπλωματική πραγματεύεται την μηχανική μάθηση ως μια ελπίδα των επιστημόνων να αποσαφηνίσουν τις δομές που διέπουν τα δεδομένα και να ανακαλύψουν και να κατανοήσουν τους κανόνες, που “κινούν” τον φυσικό κόσμο. Αρχικά, πραγματοποιείται μια πρώτη περιγραφή της μηχανικής μάθησης ως ένα από τα βασικότερα δομικά στοιχεία της τεχνητής νοημοσύνης, παρουσιάζοντας ταυτόχρονα μια πληθώρα προβλημάτων, στα οποία μπορεί να βρει λύση, ενώ γίνεται και μια σύντομη ιστορική αναδρομή της πορείας και των κομβικών της σημείων. Ακολούθως, πραγματοποιείται μια όσο το δυνατόν πιο εμπεριστατωμένη περιγραφή, μέσω χρήσης εκτεταμένης βιβλιογραφίας, σχεδιαγραμμάτων και λειτουργικών παραδειγμάτων των βασικότερων κλάδων της, όπως είναι η επιβλεπόμενη μάθηση (δέντρα αποφάσεων, νευρωνικά δίκτυα), η μη-επιβλεπόμενη μάθηση (συσταδοποίηση δεδομένων), καθώς και πιο εξειδικευμένων μορφών της, όπως είναι η ημί-επιβλεπόμενη μηχανική μάθηση και οι γενετικοί αλγόριθμοι. Επιπρόσθετα, σχεδιάζεται και υλοποιείται ένας νέος πιθανοτικός αλγόριθμος συσταδοποίησης (clustering) δεδομένων, ο οποίος ουσιαστικά αποτελεί ένα υβρίδιο ενός ιεραρχικού αλγορίθμου ομαδοποίησης και ενός αλγορίθμου διαμέρισης. Ο αλγόριθμος δοκιμάστηκε σε ένα πλήθος διαφορετικών συνόλων, πετυχαίνοντας αρκετά ενθαρρυντικά αποτελέσματα, συγκριτικά με άλλους γνωστούς αλγορίθμους, όπως είναι ο k-means και ο single-linkage. Πιο συγκεκριμένα, ο αλγόριθμος κατασκευάζει συστάδες δεδομένων, με μεγαλύτερη ομοιογένεια κατά πλειοψηφία σε σχέση με τους παραπάνω, ενώ το σημαντικότερο πλεονέκτημά του είναι ότι δεν χρειάζεται κάποια αντίστοιχη παράμετρο k για να λειτουργήσει. Τέλος, γίνονται προτάσεις τόσο για περαιτέρω βελτίωση του παραπάνω αλγορίθμου, όσο και για την ανάπτυξη νέων τεχνικών και μεθόδων, εναρμονισμένων με τις σύγχρονες τάσεις της αγοράς και προσανατολισμένων προς τις απαιτητικές ανάγκες της νέας, αναδυόμενης κοινωνίας της πληροφορίας. / In the spectrum of a new and emerging information society, the convergence of computers and telecommunication has led to a continuously increasing production and storage of huge amounts of data for almost any field of human engagement. So, if the data are recorded facts of human involvement, then information are the rules that govern them. And society depends on and looking earnestly for new information. All that remains is their discovery. The field of computer science, which deals with the systematic analysis of data in order to extract useful information, is called machine learning. In this light, therefore, this thesis discusses the machine learning as a hope of scientists to elucidate the structures that govern the data and discover and understand the rules that "move" the natural world. Firstly, a general description of machine learning, as one of the main components of artificial intelligence, is discussed, while presenting a variety of problems that machine learning can find solutions, as well as a brief historical overview of its progress. Secondly, a more detailed description of machine learning is presented by using extensive literature, diagrams, drawings and working examples of its major research areas, as is the supervised learning (decision trees, neural networks), the unsupervised learning (clustering algorithms) and more specialized forms, as is the semi-supervised machine learning and genetic algorithms. In addition to the above, it is planned and implemented a new probabilistic clustering algorithm, which is a hybrid of a hierarchical clustering algorithm and a partitioning algorithm. The algorithm was tested on a plurality of different datasets, achieving sufficiently encouraging results, as compared to other known algorithms, such as k-means and single-linkage. More specifically, the algorithm constructs data blocks, with greater homogeneity by majority with respect to the above, while the most important advantage is that it needs no corresponding parameter k to operate. Finally, suggestions are made in order to further improve the above algorithm, as well as to develop new techniques and methods in keeping with the current market trends, oriented to the demanding needs of this new, emerging information society. Μηχανική μάθηση Δέντρα αποφάσεων Νευρωνικά δίκτυα Γενετικοί αλγόριθμοι Υβριδικοί αλγόριθμοι 006.31 Machine learning Hierarchical clustering Decision trees Neural networks Genetic algorithms Hybrid algorithms
68	Τεχνικές ταξινόμησης σεισμογραμμάτων Πίκουλης, Βασίλης 01 October 2008 (has links) Σεισμικά γεγονότα τα οποία προέρχονται από σεισμικές πηγές των οποίων η απόσταση μεταξύ τους είναι πολύ μικρότερη από την απόσταση μέχρι τον κοντινότερο σταθμό καταγραφής, είναι γνωστά στη βιβλιογραφία σαν όμοια σεισμικά γεγονότα και αποτελούν αντικείμενο έρευνας εδώ και μια εικοσαετία. Η διαδικασία επαναπροσδιορισμού των υποκεντρικών παραμέτρων ή επανεντοπισμού όμοιων σεισμικών γεγονότων οδηγεί σε εκτιμήσεις των παραμέτρων που είναι συνήθως μεταξύ μίας και δύο τάξεων μεγέθους μικρότερου σφάλματος από τις αντίστοιχες των συνηθισμένων διαδικασιών εντοπισμού και επομένως, μπορεί εν δυνάμει να παράξει μια λεπτομερέστερη εικόνα της σεισμικότητας μιας περιοχής, από την οποία μπορεί στη συνέχεια να προκύψει η ακριβής χαρτογράφηση των ενεργών ρηγμάτων της. Πρόκειται για μια σύνθετη διαδικασία που μπορεί να αναλυθεί στα παρακάτω τρία βασικά βήματα: 1. Αναγνώριση ομάδων όμοιων σεισμικών γεγονότων. 2. Υπολογισμός διαφορών χρόνων άφιξης μεταξύ όμοιων σεισμικών γεγονότων. 3. Επίλυση προβλήματος αντιστροφής. Το πρώτο από τα παραπάνω βήματα είναι η αναγνώριση των λεγόμενων σεισμικών οικογενειών που υπάρχουν στον διαθέσιμο κατάλογο και έχει ξεχωριστή σημασία για την ολική επιτυχία της διαδικασίας. Μόνο εάν εξασφαλιστεί η ορθότητα της επίλυσης αυτού του προβλήματος τίθενται σε ισχύ οι προϋποθέσεις για την εφαρμογή της διαδικασίας και άρα έχει νόημα η γεωλογική ανάλυση που ακολουθεί. Είναι επίσης ένα πρόβλημα που απαντάται και σε άλλες γεωλογικές εφαρμογές, όπως είναι για παράδειγμα ο αυτόματος εντοπισμός του ρήγματος γένεσης ενός άγνωστου σεισμικού γεγονότος μέσω της σύγκρισής του με διαθέσιμες αντιπροσωπευτικές οικογένειες. Το πρόβλημα της αναγνώρισης είναι στην ουσία ένα πρόβλημα ταξινόμησης και ως εκ τούτου προϋποθέτει την επίλυση δύο σημαντικών επιμέρους υποπροβλημάτων. Συγκεκριμένα, αυτό της αντιστοίχισης των σεισμικών κυματομορφών (matching problem) και αυτό της κατηγοριοποίησής τους (clustering problem). Το πρώτο έχει να κάνει με τη σύγκριση όλων των δυνατών ζευγών σεισμογραμμάτων του καταλόγου ώστε να εντοπισθούν όλα τα όμοια ζεύγη, ενώ το δεύτερο αφορά την ομαδοποίηση των ομοίων σεισμογραμμάτων ώστε να προκύψουν οι σεισμικές οικογένειες. Στα πλαίσια αυτής της εργασίας, λαμβάνοντας υπόψη τις ιδιομορφίες που υπεισέρχονται στο παραπάνω πρόβλημα ταξινόμησης από τις ιδιαιτερότητες των σεισμογραμμάτων αλλά και την ιδιαίτερη φύση της εφαρμογής, προτείνουμε μια μέθοδο σύγκρισης που βασίζεται σε μια γενικευμένη μορφή του συντελεστή συσχέτισης και μια μέθοδο κατηγοριοποίησης βασισμένη σε γράφους, με στόχο την αποτελεσματική αλλά και αποδοτική επίλυσή του. / Seismic events that occur in a confined region, meaning that the distance separating the sources is very small compared to the distance between the sources and the recording station, are known in the literature as similar seismic events and have been under study for the past two decades. The re-estimation of the hypocenter parameters or the relocation of similar events gives an estimation error that is between one and two orders of magnitude lower that the one produced by the conventional location procedures. As a result, the application of this approach creates a much more detailed image of the seismicity of the region under study, from which the exact mapping of the active faults of the region can occur. The relocation procedure is in fact a complex procedure, consisting of three basic steps: 1. Identification of groups of similar seismic events. 2. Estimation of the arrival time differences between events of the same group. 3. Solution of the inverse problem. The first of the above steps, namely the identification of the seismic families of the given catalog plays an important role in the total success of the procedure, since only the correct solution of this problem can ensure that the requirements for the application of the procedure are met and therefore the geological analysis that is based on its outcome is meaningful. The problem is also encountered in other geological applications, such as the automatic location of the fault mechanism of an unknown event by comparison with available representative families. The problem of the identification of the seismic families is a classification problem and as such, requires the solution of two subproblems, namely the matching problem and the clustering problem. The object of the first one is the comparison of all the possible event pairs of the catalog with the purpose of locating all the existing similar pairs, while the second one is concerned with the grouping of the similar pairs into seismic families. In this work, taking into consideration the particularities that supersede the classification problem described above due to the special nature of the seismograms and also the specific requirements of the application, we propose a comparing method which is based on a generalized form of the correlation coefficient and a graph – based clustering technique, as an effective solution of the problem at hand. Ταξινόμηση δεδομένων Αντιστοίχηση σήματος Συσχέτιση Κατηγοριοποίηση Θεωρία Γράφων 551.220 287 Similar seismic events Data classification Signal matching Correlation Clustering Hierarchical clustering Graph Theory
69	Pattern Recognition in Single Molecule Force Spectroscopy Data Paulin, Hilary 05 September 2013 (has links) We have developed an analytical technique for single molecule force spectroscopy (SMFS) data that avoids filtering prior to analysis and performs pattern recognition to identify distinct SMFS events. The technique characterizes the signal similarity between all curves in a data set and generates a hierarchical clustering tree, from which clusters can be identified, aligned, and examined to identify key patterns. This procedure was applied to alpha-lactalbumin (aLA) on polystyrene substrates with flat and nanoscale curvature, and bacteriorhodopsin (bR) adsorbed on mica substrates. Cluster patterns identified for the aLA data sets were associated with different higher-order protein-protein interactions. Changes in the frequency of the patterns showed an increase in the monomeric signal from flat to curved substrates. Analysis of the bR data showed a high level of multiple protein SMFS events and allowed for the identification of a set of characteristic three-peak unfolding events. biophysics single molecule biophysics single molecule force spectroscopy alpha-lactalbumin bacteriorhodopsin pattern recognition hierarchical clustering clustering force spectroscopy protein adsorption protein structure surface curvature atomic force microscopy
70	Machine Learning Approaches to Refining Post-translational Modification Predictions and Protein Identifications from Tandem Mass Spectrometry Chung, Clement 11 December 2012 (has links) Tandem mass spectrometry (MS/MS) is the dominant approach for large-scale peptide sequencing in high-throughput proteomic profiling studies. The computational analysis of MS/MS spectra involves the identification of peptides from experimental spectra, especially those with post-translational modifications (PTMs), as well as the inference of protein composition based on the putative identified peptides. In this thesis, we tackled two major challenges associated with an MS/MS analysis: 1) the refinement of PTM predictions from MS/MS spectra and 2) the inference of protein composition based on peptide predictions. We proposed two PTM prediction refinement algorithms, PTMClust and its Bayesian nonparametric extension \emph{i}PTMClust, and a protein identification algorithm, pro-HAP, that is based on a novel two-layer hierarchical clustering approach that leverages prior knowledge about protein function. Individually, we show that our two PTM refinement algorithms outperform the state-of-the-art algorithms and our protein identification algorithm performs at par with the state of the art. Collectively, as a demonstration of our end-to-end MS/MS computational analysis of a human chromatin protein complex study, we show that our analysis pipeline can find high confidence putative novel protein complex members. Moreover, it can provide valuable insights into the formation and regulation of protein complexes by detailing the specificity of different PTMs for the members in each complex. Machine Learning Unsupervised Learning Clustering Mass Spectrometry Protein Identification Nonparameteric Bayesian method Hierarchical clustering 0800 0984 0715

Search results