Global ETD Search

1	Learning a Multiview Weighted Majority Vote Classifier : Using PAC-Bayesian Theory and Boosting / Apprentissage de vote de majorité pour la classification multivue : Utilisation de la théorie PAC-Bayésienne et du boosting Goyal, Anil 23 October 2018 (has links) La génération massive de données, nous avons de plus en plus de données issues de différentes sources d’informations ayant des propriétés hétérogènes. Il est donc important de prendre en compte ces représentations ou vues des données. Ce problème d'apprentissage automatique est appelé apprentissage multivue. Il est utile dans de nombreux domaines d’applications, par exemple en imagerie médicale, nous pouvons représenter le cerveau humains via des IRM, t-fMRI, EEG, etc. Dans cette cette thèse, nous nous concentrons sur l’apprentissage multivue supervisé, où l’apprentissage multivue est une combinaison de différents modèles de classifications ou de vues. Par conséquent, selon notre point de vue, il est intéressant d’aborder la question de l’apprentissage à vues multiples dans le cadre PAC-Bayésien. C’est un outil issu de la théorie de l’apprentissage statistique étudiant les modèles s’exprimant comme des votes de majorité. Un des avantages est qu’elle permet de prendre en considération le compromis entre précision et diversité des votants, au cœur des problématiques liées à l’apprentissage multivue. La première contribution de cette thèse étend la théorie PAC-Bayésienne classique (avec une seule vue) à l’apprentissage multivue (avec au moins deux vues). Pour ce faire, nous définissons une hiérarchie de votants à deux niveaux: les classifieurs spécifiques à la vue et les vues elles-mêmes. Sur la base de cette stratégie, nous avons dérivé des bornes en généralisation PAC-Bayésiennes (probabilistes et non-probabilistes) pour l’apprentissage multivue. D'un point de vue pratique, nous avons conçu deux algorithmes d'apprentissage multivues basés sur notre stratégie PAC-Bayésienne à deux niveaux. Le premier algorithme appelé PB-MVBoost est un algorithme itératif qui apprend les poids sur les vues en contrôlant le compromis entre la précision et la diversité des vues. Le second est une approche de fusion tardive où les prédictions des classifieurs spécifiques aux vues sont combinées via l’algorithme PAC-Bayésien CqBoost proposé par Roy et al. Enfin, nous montrons que la minimisation des erreurs pour le vote de majorité multivue est équivalente à la minimisation de divergences de Bregman. De ce constat, nous proposons un algorithme appelé MωMvC2 pour apprendre un vote de majorité multivue. / With tremendous generation of data, we have data collected from different information sources having heterogeneous properties, thus it is important to consider these representations or views of the data. This problem of machine learning is referred as multiview learning. It has many applications for e.g. in medical imaging, we can represent human brain with different set of features for example MRI, t-fMRI, EEG, etc. In this thesis, we focus on supervised multiview learning, where we see multiview learning as combination of different view-specific classifiers or views. Therefore, according to our point of view, it is interesting to tackle multiview learning issue through PAC-Bayesian framework. It is a tool derived from statistical learning theory studying models expressed as majority votes. One of the advantages of PAC-Bayesian theory is that it allows to directly capture the trade-off between accuracy and diversity between voters, which is important for multiview learning. The first contribution of this thesis is extending the classical PAC-Bayesian theory (with a single view) to multiview learning (with more than two views). To do this, we considered a two-level hierarchy of distributions over the view-specific voters and the views. Based on this strategy, we derived PAC-Bayesian generalization bounds (both probabilistic and expected risk bounds) for multiview learning. From practical point of view, we designed two multiview learning algorithms based on our two-level PAC-Bayesian strategy. The first algorithm is a one-step boosting based multiview learning algorithm called as PB-MVBoost. It iteratively learns the weights over the views by optimizing the multiview C-Bound which controls the trade-off between the accuracy and the diversity between the views. The second algorithm is based on late fusion approach where we combine the predictions of view-specific classifiers using the PAC-Bayesian algorithm CqBoost proposed by Roy et al. Finally, we show that minimization of classification error for multiview weighted majority vote is equivalent to the minimization of Bregman divergences. This allowed us to derive a parallel update optimization algorithm (referred as MωMvC2) to learn our multiview weighted majority vote. Apprentissage multivue Théorie PAC-Bayésienne Votes de majorité Multiview Learning PAC-Bayesian Theory Boosting Majority Vote
2	Trading Strategy Mining with Gene Expression Programming Huang, Chang-Hao 12 September 2012 (has links) In the thesis, we apply the gene expression programming (GEP) to training profitable trading strategies. We propose a model which utilizes several historical periods that are highly related to the current template period, and the best trading strategies of the historical periods generate the trading signals. To keep stability of our model, we proposed the trading decision mechanism based on simple majority vote in our model. The Taiwan Stock Exchange Capitalization Weighted Stock Index (TAIEX) is selected as our investment target and the trading period starts from 2000/9/14 to 2012/1/17, approximately twelve years. In our experiments, the lengths of our training period are 60, 90, 120, 180, and 270 trading days, respectively. We observe that the model with higher voting threshold usually can make profitable trading decisions. The best cumulative return 236.25\% and the best annualized cumulative return 10.63\% occur when the 180-day training models pairs with available threshold 0.21 and voting threshold 0.88, which are higher than the cumulative return 0.96\% and annualized cumulative return 0.08\% of the buy-and-hold strategy. simple majority vote feature set strategy pool gene expression programming
3	Apprentissage de vote de majorité pour la classification supervisée et l'adaptation de domaine : Approches PAC Bayésiennes et combinaison de similarités Morvant, Emilie 18 September 2013 (has links) De nombreuses applications font appel à des méthodes d'apprentissage capables de considérer différentes sources d'information (e.g. images, son, texte) en combinant plusieurs modèles ou descriptions. Cette thèse propose des contributions théoriquement fondées permettant de répondre à deux problématiques importantes pour ces méthodes :(i) Comment intégrer de la connaissance a priori sur des informations ?(ii) Comment adapter un modèle sur des données ne suivant pas la distribution des données d'apprentissage ?Une 1ère série de résultats en classification supervisée s'intéresse à l'apprentissage de votes de majorité sur des classifieurs dans un contexte PAC-Bayésien prenant en compte un a priori sur ces classifieurs. Le 1er apport étend un algorithme de minimisation de l'erreur du vote en classification binaire en permettant l'utilisation d'a priori sous la forme de distributions alignées sur les votants. Notre 2ème contribution analyse théoriquement l'intérêt de la minimisation de la norme opérateur de la matrice de confusion de votes dans un contexte de données multiclasses. La 2nde série de résultats concerne l'AD en classification binaire : le 3ème apport combine des fonctions similarités (epsilon,gamma,tau)-Bonnes pour inférer un espace rapprochant les distributions des données d'apprentissage et de test à l'aide de la minimisation d'une borne. Notre 4ème contribution propose une analyse PAC-Bayésienne de l'AD basée sur une divergence entre distributions. Nous en dérivons des garanties théoriques pour les votes de majorité et un algorithme adapté aux classifieurs linéaires minimisant cette borne. / Many applications make use of machine learning methods able to take into account different information sources (e.g. sounds, image, text) by combining different descriptors or models. This thesis proposes a series of contributions theoretically founded dealing with two mainissues for such methods:(i) How to embed some a priori information available?(ii) How to adapt a model on new data following a distribution different from the learning data distribution? This last issue is known as domain adaptation (DA).A 1st series of contributions studies the problem of learning a majority vote over a set of voters for supervised classification in the PAC-Bayesian context allowing one to consider an a priori on the voters. Our 1st contribution extends an algorithm minimizing the error of the majority vote in binary classification by allowing the use of an a priori expressed as an aligned distribution. The 2nd analyses theoretically the interest of the minimization of the operator norm of the confusion matrix of the votes in the multiclass setting. Our 2nd series of contributions deals with DA for binary classification. The 3rd result combines (epsilon,gamma,tau)-Good similarity functions to infer a new projection space allowing us to move closer the learning and test distributions by means of the minimization of a DA bound. Finally, we propose a PAC-Bayesian analysis for DA based on a divergence between distributions. This analysis allows us to derive guarantees for learning majority votes in a DA context, and to design an algorithm specialized to linear classifiers minimizing our bound. Apprentissage Automatique Vote de majorité Théorie PAC-Bayésienne Classification supervisée Adaptation de domaine Machine Learning Majority vote PAC-Bayesian theory Supervised classification Domain Adaptation 004
4	Modelo do voto da maioria com distribuição mista de ruído LIMA JÚNIOR, Aranildo Rodrigues de 11 February 2011 (has links) Submitted by (ana.araujo@ufrpe.br) on 2016-05-25T13:54:35Z No. of bitstreams: 1 Aranildo Rodrigues de Lima Junior.pdf: 636074 bytes, checksum: 5f3ad98d36eb71272e1ee3c218fe2afb (MD5) / Made available in DSpace on 2016-05-25T13:54:35Z (GMT). No. of bitstreams: 1 Aranildo Rodrigues de Lima Junior.pdf: 636074 bytes, checksum: 5f3ad98d36eb71272e1ee3c218fe2afb (MD5) Previous issue date: 2011-02-11 / In the majority-vote model with noise, defined in a network, a given site (spin) assumes the posite state (sign) of the majority of its neighboring spins with probability q and it takes the same state with probability (1−q). The noise parameter q is homogeneous for all sites. In this work, we investigate a more general and realistic version of the majority-vote model, in which a given site i has its own noise parameter qi satisfying a mixed probability distribution. In this way, there is a heterogeneous distribution of noise among the sites in the network. We consider the case of a distribution defined by P(qi) = bd (qi)+(1−b)d (qi−q), where b is the fraction of sites without noise and q is taken from a Gaussian distribution. We perform Monte Carlo simulations on random graphs of different sizes and three average connectivity, for several values of the parameter b. We calculate the magnetization, the susceptibility and the Binder’s fourth-order cumulant as functions of q. We note that the system presents an order-disorder phase transition at a critical value of the parameter noise qc, which is an increasing function of the fraction of sites without noise. We use finite-size scaling theory to construct the phase diagram of the model and estimate the critical exponents b /n , g / nd 1/n . These exponents satisfy the hyperscaling relation with effective dimensionality equals to unity, for all values of average connectivity and b. Finally we conclude that, the majority-vote model with mixed distribution of noise on random graphs belongs to a different universality class from the model with homogeneous distribution of noise. / No modelo do voto da maioria com ruído, definido em uma rede, um dado sítio (spin) toma o estado contrário (sinal) à maioria dos seus vizinhos com probabilidade q e concorda com o estado da maioria dos seus vizinhos com probabilidade (1−q), onde q é o parâmetro de ruído homogêneo para todos os sítios. Nessa dissertação investigamos o modelo do voto da maioria com distribuição mista de ruídos, no qual cada sítio tem o parâmetro q satisfazendo uma distribuição mista de probabilidade de forma que há uma distribuição heterogênea com relação aos ruídos dos sítios da rede. Consideramos o caso de uma distribuição dada por P(qi) = bd (qi)+(1−b)d (qi −q), onde b é a fração de sítios com ausência de ruído e q é dado por uma distribuição Gaussiana. Realizamos simulações de Monte Carlo para diversos valores do parâmetro b, em grafos aleatórios de diferentes tamanhos N e três valores da conectividade média. Calculamos a magnetização, a susceptibilidade e o cumulante de Binder como funções de q. Notamos que o sistema apresenta uma transição de fase do tipo ordem-desordem em um valor crítico do parâmetro de ruído qc, o qual é uma função crescente da fração de sítios com ausência de ruído. A partir da teoria de escala de tamanho finito construímos o diagrama de fases do modelo no plano qc versus b e estimamos os expoentes críticos b /n , g /n e 1/n . Esses expoentes satisfazem a relação de hiper-escala com a dimensionalidade efetiva do sistema D = 1 para todos os valores da conectividade média e b. Por fim concluímos que o modelo do voto da maioria com distribuição mista de ruído, pertence a uma classe de universalidade diferente do modelo com distribuição homogênea de ruído em grafos aleatórios. Modelo do voto da maioria Método de Monte Carlo Grafo aleatório Probabilidade mista Majority-vote model Monte Carlo method Random graphs Mixed probability distribution
5	Σχεδιασμός ανάπτυξη και εφαρμογή συστήματος υποστήριξης της διάγνωσης επιχρισμάτων θυρεοειδούς δεδομένων βιοψίας με λεπτή βελόνη FNA με χρήση εξελιγμένων μεθόδων εξόρυξης δεδομένων Ζούλιας, Εμμανουήλ 17 September 2012 (has links) Σκοπός της παρούσας διδακτορικής διατριβής είναι η ανάπτυξη ενός ολοκληρωμένου συστήματος υποστήριξης της διάγνωσης (Decision Support System - DSS) με χρήση μεθόδων εξόρυξης δεδομένων για την ταξινόμηση επιχρισμάτων βιοψίας με λεπτή βελόνα (Fine Needle Aspiration - FNA). Δύο κατηγορίες επιλέχθηκαν για τα δείγματα FNA: καλοήθεια και κακοήθεια. Το σύστημα αυτό αποτελείται από τις ακόλουθες βαθμίδες: 1) συλλογής δεδομένων, 2) επιλογής δεδομένων, 3) εύρεσης κατάλληλων χαρακτηριστικών, 4) εφαρμογής ταξινόμησης με χρήση μεθόδων εξόρυξης δεδομένων. Επίσης, βασικός στόχος της παρούσας διδακτορικής διατριβής ήταν η βελτίωση της ορθής ταξινόμησης των ύποπτων επιχρισμάτων (suspicious), για τα οποία είναι γνωστή η αδυναμία της μεθόδου FNA να τα ταξινομήσει. Το σύστημα εκπαιδεύτηκε και ελέγχθηκε σε σχέση με το δείγμα για το οποίο είχαμε ιστολογικές επιβεβαιώσεις (ground truth). Για περιπτώσεις οι οποίες χαρακτηρίστηκαν ως μη κακοήθεις από την FNA, και για τις οποίες δεν είχαμε ιστολογικές επιβεβαιώσεις, το δείγμα προέκυψε από την συνεκτίμηση και άλλων κλινικών, εργαστηριακών και απεικονιστικών εξετάσεων. Στα πλαίσια της παρούσας διδακτορικής διατριβής συλλέχθηκαν εξετάσεις FNA θυρεοειδούς από το Εργαστήριο Παθολογοανατομίας του Α’ Τμήματος Παθολογίας της Ιατρικής Σχολής του Πανεπιστημίου Αθηνών. Δεδομένου ότι το εν λόγω εργαστήριο λειτουργεί και σαν κέντρο αναφοράς, σημαντικός αριθμός των δειγμάτων εστάλησαν εκεί και από άλλα Εργαστήρια Παθολογοανατομίας για επανέλεγχο. Το αρχειακό υλικό ήταν πολύ καλά ταξινομημένο σε χρονολογική σειρά αλλά ήταν σε έντυπη μορφή. Αρχικά πραγματοποιήθηκε η ανάλυση απαιτήσεων για τη δομή και το σχεδιασμό της βάσης δεδομένων. Με βάση τα στοιχεία από την τεκμηριωμένη διάγνωση σχεδιάστηκε και αναπτύχθηκε προηγμένο σύστημα για την κωδικοποίηση και αρχικοποίηση των δεδομένων. Με τη βοήθεια του σχεδιασμού και ανάλυσης απαιτήσεων αναπτύχθηκε και υλοποιήθηκε η βάση δεδομένων στην οποία αποθηκεύτηκαν τα δεδομένα προς επεξεργασία. Παράλληλα, με το σχεδιασμό της βάσης έγινε και η προεργασία για το σχεδιασμό και την ανάλυση απαιτήσεων του γραφικού περιβάλλοντος εισαγωγής στοιχείων. Λαμβάνοντας υπόψη ότι το σύστημα θα μπορούσε να χρησιμοποιηθεί και πέρα από τα πλαίσια της παρούσας διδακτορικής διατριβής λήφθηκε μέριμνα ώστε να παρέχεται ένα φιλικό και ευέλικτο προς το χρήστη περιβάλλον. Σύμφωνα με τη μεθοδολογία προσέγγισης η οποία ακολουθήθηκε προηγήθηκε στατιστική ανάλυση των 9.102 συλλεχθέντων δειγμάτων FNA ως προς τα κυτταρολογικά χαρακτηριστικά τους και τις διαγνώσεις. Οι κυτταρολογικές διαγνώσεις των συγκεκριμένων δειγμάτων συσχετίστηκαν με τις ιστολογικές διαγνώσεις, στοχεύοντας στον υπολογισμό της πιθανής επίδρασης και συμβολής κάθε κυτταρολογικού χαρακτηριστικού σε μια ορθή ή ψευδή κυτταρολογική διάγνωση, έτσι ώστε να προσδιοριστούν οι πιθανές πηγές λανθασμένης διάγνωσης. Τα δείγματα τα οποία περιείχαν μόνο αίμα ή πολύ λίγα θυλακειώδη κύτταρα χωρίς κολλοειδές θεωρήθηκαν ανεπαρκή για τη διάγνωση. Οι βιοψίες εκτελέσθηκαν είτε στο Α’ τμήμα του Πανεπιστημίου Αθηνών (οι περισσότερες από τις περιπτώσεις με ψηλαφητούς όζους) είτε αλλού (κυρίως κάτω από την καθοδήγηση του κέντρου αναφοράς). Τα δείγματα επιστρωμένα σε πλακάκια, στάλθηκαν στο κέντρο αναφοράς από διάφορα νοσοκομεία, με διαφορετικά πρωτόκολλα σχετικά με τα κριτήρια εκτέλεσης βιοψίας FNA σε θυρεοειδή. Μετεγχειρητικές ιστολογικές επαληθεύσεις ήταν διαθέσιμες για 266 ασθενείς (κακοήθειες και μη). Το χαμηλό ποσοστό ιστολογικών επαληθεύσεων οφείλεται στην ετερογενή προέλευση των ασθενών και στην έλλειψη ολοκληρωμένης παρακολούθησης και επανελέγχου των ασθενών. Για την αξιολόγηση των δεδομένων χρησιμοποιήθηκαν περιγραφικά στατιστικά μεγέθη όπως, μέση τιμή, τυπική απόκλιση, ποσοστά, μέγιστο και ελάχιστο. Έγιναν επίσης και χ2 δοκιμές επιπέδου σημαντικότητας διαφόρων παραμέτρων για να ελεγχθεί η πιθανή συσχέτιση ή η ανεξαρτησία. Για τη συσχέτιση των κυτταρολογικών και των ιστολογικών διαγνώσεων και την αξιολόγηση των εργαστηριακών ευρημάτων, πέραν των περιγραφικών στατιστικών μεγεθών χρησιμοποιήθηκαν και υπολογισμοί της ευαισθησίας, της ειδικότητας, της συνολικής ακρίβειας, της αρνητικής και θετικής αξίας πρόβλεψης (negative and positive predictive value). Προκειμένου να καθοριστεί εάν μια κατηγορία ασθενειών συσχετίζεται ή όχι με συγκεκριμένες κυτταρολογικές παραμέτρους εφαρμόστηκε μέθοδος ελέγχου στατιστικής σημαντικότητας σε επίπεδο 5% (p < 0,05). Η διαδικασία ακολουθήθηκε για κάθε κατηγορία ασθενειών ή συνδυασμό τους και για κάθε παράμετρο των κυτταρολογικών και αρχιτεκτονικών στοιχείων της κυτταρολογικής διάγνωσης. Τα αποτελέσματα της στατιστικής ανάλυσης επέτρεψαν το διαχωρισμό των δεδομένων σε καλοήθη, κακοήθη, νεοπλασματικά, ύποπτα για κακοήθεια και οριακά με χαρακτηριστικά γνωρίσματα μεταξύ ενός καλοήθους και ενός νεοπλασματικού. Στην συνέχεια αναπτύχθηκε σύστημα υποστήριξης της διάγνωσης χρησιμοποιώντας εξειδικευμένες μεθόδους εξόρυξης δεδομένων. Το σύστημα αποτελείται από τέσσερις βαθμίδες. Η πρώτη βαθμίδα αυτού του συστήματος είναι το περιβάλλον Συλλογής Δεδομένων στην οποία τα δεδομένα αποθηκεύονται στη βάση δεδομένων. Η Δεύτερη Βαθμίδα αυτού του συστήματος αφορά στην Επιλογή Δεδομένων. Σύμφωνα με την καταγραφή των απαιτήσεων, την εισαγωγή και τη ψηφιοποίηση των στοιχείων, δημιουργήθηκαν 111 χαρακτηριστικά για κάθε ασθενή (record). Τα περισσότερα χαρακτηριστικά είχαν τιμές δυαδικού τύπου, αποτυπώνοντας την ύπαρξη ή μη του κάθε χαρακτηριστικού, ενώ κάποιες άλλες είχαν τιμές τύπων αριθμών ή αλφαριθμητικών χαρακτήρων. Από τα 111 χαρακτηριστικά επιλέχθηκαν 60 χαρακτηριστικά τα οποία περιγράφουν τη δομή των επιχρισμάτων ενώ δημιουργήθηκαν άλλα 7 χαρακτηριστικά τα οποία αφορούσαν στην ομαδοποίηση άλλων χαρακτηριστικών. Η Τρίτη Βαθμίδα του συστήματος αφορά στην εύρεση των Κατάλληλων Χαρακτηριστικών. Λόγω του αρχικά υψηλού αριθμού χαρακτηριστικών παραμέτρων (67 ανά περίπτωση), ήταν απαραίτητο να εξαλειφθούν οι χαρακτηριστικές παράμετροι που συσχετίζονταν γραμμικά ή δεν είχαν καμία διαγνωστική πληροφορία. H μέθοδος επιλογής χαρακτηριστικών εφαρμόστηκε πριν από την ταξινόμηση, με γνώμονα την ανεύρεση ενός υποσυνόλου των χαρακτηριστικών παραμέτρων που βελτιστοποιούν σε ακρίβεια τη διαδικασία ταξινόμησης. Εφαρμόστηκε η τεχνική επιπλέουσας πρόσθιας ακολουθιακά μεταβαλλόμενης επιλογής (SFFS). Ο αριθμός των δειγμάτων που χρησιμοποιήθηκαν είναι 2.036 (1.886 καλοήθειες και 150 κακοήθειες). Εξ αυτών, όλες οι κακοήθειες είναι ιστολογικά επιβεβαιωμένες. Επίσης, 140 καλοήθειες είναι ιστολογικά επιβεβαιωμένες με επάρκεια υλικού. Οι υπόλοιπες 1.726 καλοήθειες είναι επιβεβαιωμένες με συνεκτίμηση κλινικών, εργαστηριακών και απεικονιστικών ιατρικών εξετάσεων (υπέρηχοι κ.λπ.). Από τα 2.036 δείγματα, το 25% χρησιμοποιήθηκε για την επιλογή χαρακτηριστικών παραμέτρων, δηλαδή 37 περιπτώσεις κακοήθειας (Malignant) και 472 περιπτώσεις καλοήθειας (Non Malignant). Από την εφαρμογή της τεχνικής (SFFS) επιλέχθηκαν τελικά 12 χαρακτηριστικά ως βέλτιστα για την ταξινόμηση των δεδομένων FNA σε καλοήθη και κακοήθη. Η Τέταρτη βαθμίδα επεξεργασίας είναι η Εφαρμογής Ταξινόμησης με χρήση Μεθόδων Εξόρυξης Δεδομένων ή Ταξινομητής. Για το σκοπό αυτό, επιλέχθηκε να εφαρμοστεί μια πληθώρα αξιόπιστων, καλά επιβεβαιωμένων και σύγχρονων μεθόδων εξόρυξης δεδομένων. Το σύστημα εκπαιδεύτηκε και ελέγχθηκε σε σχέση με το δείγμα για το οποίο είχαμε ιστολογικές επιβεβαιώσεις (ground truth). Η ανεξάρτητη εφαρμογή τεσσάρων αξιόπιστων μεθόδων, Δέντρων Αποφάσεων (Decision Trees), Τεχνιτών Νευρωνικών Δικτύων (Artificial Neural Network), Μηχανών Στήριξης Διανυσμάτων (Support Vector Machine), και Κ - κοντινότερου γείτονα (k-NN), έδωσε αποτελέσματα συγκρίσιμα με αυτά της FNA μεθόδου. Περαιτέρω βελτίωση των αποτελεσμάτων επιτεύχθηκε με την εφαρμογή της μεθόδου πλειοψηφικού κανόνα (Majority Vote - CMV) συνδυάζοντας τα αποτελέσματα από την εφαρμογή των τριών καλύτερων αλγορίθμων, ήτοι των Νευρωνικών Δικτύων, Μηχανών Στήριξης Διανυσμάτων και Κ - κοντινότερου γείτονα. Η τροποποιημένη μέθοδος τεχνητών αυτοάνοσων συστημάτων (Artificial Immune Systems – AIS) χρησιμοποιήθηκε για πρώτη φορά στην ταξινόμηση και παρουσίασε ιδιαίτερα βελτιωμένα αποτελέσματα στην ταξινόμηση των επιχρισμάτων τα οποία χαρακτηρίζονται ύποπτα (suspicious) από τους ειδικούς και αποτελούν το αδύναμο σημείο της μεθόδου FNA. Αυτές οι περιπτώσεις υπόνοιας αποτελούν ένα πολύ δύσκολο κομμάτι για τη διάκριση μεταξύ των καλοηθειών και των κακοηθειών, ακόμα και για τους πλέον ειδικούς. Επειδή όλα τα περιστατικά που χαρακτηρίζονται από την βιοψία FNA ως υπόνοιες αντιμετωπίζονται κλινικά σαν κακοήθειες, η εφαρμογή των αλγοριθμικών μεθόδων βελτιώνει αισθητά τη διαχείριση αυτών των περιπτώσεων μειώνοντας τον αριθμό των άσκοπων χειρουργικών επεμβάσεων θυρεοειδεκτομών. / The Aim of present thesis is the development of an integrated system for supporting diagnosis (Decision Support System - DSS) using for categorizing FNA biopsy smears. Two categories were selected for the FNA smears: malignant and nonmalignant. The system is constituted by the following stages of 1) data collection, 2) data selection 3) choice of suitable clinical and cytological features, 4) application of data mining method for the categorization of FNA biopsy smears. Furthermore a fundamental objective of the doctoral thesis was the improvement of suspect smears (suspicious) categorization, for the latter FNA Biopsy has a known restriction. The system had been trained and checked in relation to the sample that histologic evaluation existed (ground truth). For smears that characterized as nonmalignant by FNA and histological data we’re not available, complementary clinical, laboratory and imaging evaluations took into account in order to create the sample. Τhe smears that were available in this thesis, were collected from FNA biopsies in Pathologoanatomy Laboratory, A’ Pathology Department, Medical School of Athens University. Given that the above referred laboratory is a reference center, an important number of FNA smears were sent to it from other laboratories for cross check. The examination files were sorted in chronological order, but there were in paper forms. The requirements for the formation and the design of database system were collected. Based on the material of the diagnosis an improved system was designed and developed for data initialization and coding. The database was developed based on the design and analysis of requirements; in this database data were stored for further investigation. Analysis of the graphical user interface design was performed in parallel to the database design. Taking into account that the system might be used after the completion of thesis, the graphical user interface was designed in order to be user friendly and flexible environment. According to the methodological approach that was followed, the various cytological characteristic of 9102 FNA smears aspired among 2000-2004 was analyzed statistically. The cytological reports cross correlated with histological diagnoses, aiming to calculate the effect or contribution of each cytological characteristic to a false or true cytological diagnosis and to find the possible sources of erroneous diagnosis. The smears that have blood or a few follicular cells without colloid were characterized as insufficient for further diagnosis. The aspiration was performed either in Α’ department of Athens University (most of the cases with palpable nodules) or elsewhere (mainly under guidance of the reference center). The acquired smears being send to the reference center from various hospitals with different protocols concerning criteria to perform a thyroid FNA. Histological reports were available for 266 patients. The small number of histological verifications was due to the heterogeneity and the lack of patients files. For evaluating of data, descriptive statistic values were used like mean, standard deviation, percentage, maximum and minimum. In addition to that χ2 tests of significance were performed in order to check possible correlation or independence. For correlating cytological and histological diagnosis and evaluating laboratory findings, apart from the descriptive statistic parameters also calculated sensitivity, specificity, total accuracy, negative predictive value and positive predictive value. Method of statistical significance in the level of 5% (p < 0,05) was applied in order to specify if a disease was correlated to a cytological parameter. Those checks were performed for each disease category in correlation to any cytological parameter. Statistical analysis divided the smears into nonmalignant, malignant, neoplasms, suspicious for malignancy and borderline. A diagnosis support system was implemented using data mining methods. The system is consisted of four stages. The First stage of the system is the Data Collection environment, which stores the data to the database. The Second stage of this system concerns the Selection of Data. User requirements concluded that 111 characteristics are needed to describe each patient (record). Most of them have binary values, presenting existence and not existence, other have alphanumeric and number values. Among them 60 were selected and 7 more are produced from grouping other characteristics. The final analysis reveals that 67 characteristics of the smears are capable for describing the structure of smears in general. The Third stage of system concerns the Selection of Best Characteristics. Due to the high number of attributes (67 per case), it was essential to eliminate the characteristics that are connected linearly or do not bring diagnostics information. The choice of characteristics applied before the classification, having the aim of discovering a subset of characteristics that optimizes the process of classification. The technique of Sequential Float Forward Search (SFFS) was applied. The number of patients that used was 2,036 (1886 non malignancies and 150 malignancies). Among them all malignancies were histologically confirmed. In addition to that 140 no malignancies were histologically confirmed in correlation to evaluation of clinics, laboratorial and medical image actions (ultrasounds etc.). Among 2.036 smears the 25% used for characteristics selection, 37 smears of Malignant and smears of Non Malignant. The Sequential Float Forward Search (SFFS) Technique, choose the best 12 elements that they reveal high performance to FNA data categorization. The Fourth stage is the Application of Classification using Data Mining Methods or in other words data mining method. For this aim a set of reliable, well confirmed but also modern methods applied. In addition to that the system was trained and was checked using the sample with histological verifications (ground truth). The independent application of four reliable methods, Decision Trees, Artificial Neural Network, Support Vector Machine, and k-NN, resulting to comparable outcomes concerning those of FNA. However, further improvement was achieved with the application of Majority (Majority Vote - CMV) using of previous results of three algorithms Artificial Neural Network, Support Vector Machine, and k-NN. The modified Artificial Immune System (AIS) was applied for first time. AIS presents particularly improved results for the categorization of smears, which are characterised “suspicious” by the experts and is a known weakness of FNA method. These cases constitute a very difficult part for the discrimination among non-malignant and malignant, even for a specialist. Since all these cases are faced clinically using FNA as malignancies, the application of an improved algorithmic method improves accordingly the management of these cases by decreasing the number of useless surgical thyroid operations. Εξόρυξη δεδομένων Δέντρα αποφάσεων 610.285 Medical decision support system Data mining FNA biopsy Neural networks Decision trees k-Nearest neighborhood Immune systems Majority vote Support vector systems Feature selection

1

Page generated in 0.0707 seconds