Global ETD Search

111	Human mobility behavior : Transport mode detection by GPS data Sadeghian, Paria January 2021 (has links) GPS tracking data are widely used to understand human travel behavior and to evaluate the impact of travel. A major advantage with the usage of GPS tracking devices for collecting data is that it enables the researcher to collect large amounts of highly accurate and detailed human mobility data. However, unlabeled GPS tracking data does not easily lend itself to detecting transportation mode and this has given rise to a range of methods and algorithms for this purpose. The algorithms used vary in design and functionality, from defining specific rules to advanced machine learning algorithms. There is however no previous comprehensive review of these algorithms and this thesis aims to identify their essential features and methods and to develop and demonstrate a method for the detection of transport mode in GPS tracking data. To do this, it is necessary to have a detailed description of the particular journey undertaken by an individual. Therefore, as part of the investigation, a microdata analytic approach is applied to the problem areas, including the stages of data collection, data processing, analyzing the data, and decision making. In order to fill the research gap, Paper I consists of a systematic literature review of the methods and essential features used for detecting the transport mode in unlabeled GPS tracking data. Selected empirical studies were categorized into rule-based methods, statistical methods, and machine learning methods. The evaluation shows that machine learning algorithms are the most common. In the evaluation, I compared the methods previously used, extracted features, types of dataset, and model accuracy of transport mode detection. The results show that there is no standard method used in transport mode detection. In the light of these results, I propose in Paper II a stepwise methodology to detect five transport modes taking advantage of the unlabeled GPS data by first using an unsupervised algorithm to detect the five transport modes. A GIS multi-criteria process was applied to label part of the dataset. The performance of the five supervised algorithms was evaluated by applying them to different portions of the labeled dataset. The results show that stepwise methodology can achieve high accuracy in detecting the transport mode by labeling only 10% of the data from the entire dataset. For the future, one interesting area to explore would be the application of the stepwise methodology to a balanced and larger dataset. A semi-supervised deep-learning approach is suggested for development in transport mode detection, since this method can detect transport modes with only small amounts of labeled data. Thus, the stepwise methodology can be improved upon for further studies. Transport mode detection Machine learning Statistical learning Rule-based method Data labeling Transport Systems and Logistics Transportteknik och logistik Computer Sciences Datavetenskap (datalogi)
112	Least Squares in Sampling Complexity and Statistical Learning Bartel, Felix 19 January 2024 (has links) Data gathering is a constant in human history with ever increasing amounts in quantity and dimensionality. To get a feel for the data, make it interpretable, or find underlying laws it is necessary to fit a function to the finite and possibly noisy data. In this thesis we focus on a method achieving this, namely least squares approximation. Its discovery dates back to around 1800 and it has since then proven to be an indispensable tool which is efficient and has the capability to achieve optimal error when used right. Crucial for the least squares method are the ansatz functions and the sampling points. To discuss them, we gather tools from probability theory, frame subsampling, and $L_2$-Marcinkiewicz-Zygmund inequalities. With that we give results in the worst-case or minmax setting, when a set of points is sought for approximating a class of functions, which we model as a generic reproducing kernel Hilbert space. Further, we give error bounds in the statistical learning setting for approximating individual functions from possibly noisy samples. Here, we include the covariate-shift setting as a subfield of transfer learning. In a natural way a parameter choice question arises for balancing over- and underfitting effect. We tackle this by using the cross-validation score, for which we show a fast way of computing as well as prove the goodness thereof.:1 Introduction 2 Least squares approximation 3 Reproducing kernel Hilbert spaces (RKHS) 4 Concentration inequalities 5 Subsampling of finite frames 6 L2 -Marcinkiewicz-Zygmund (MZ) inequalities 7 Least squares in the worst-case setting 8 Least squares in statistical learning 9 Cross-validation 10 Outlook info:eu-repo/classification/ddc/518 ddc:518
113	Caracterisation des suspensions par des methodes optiques. modelisation par reseaux de neurones / Characterization of suspensions using optical methods. neural networks modeling. Bongono, Juilien 03 September 2010 (has links) La sédimentation des suspensions aqueuses de particules minérales microniques, polydisperses et concentrées a été analysée à l’aide du Turbiscan MA 2000 fondé sur la diffusion multiple de la lumière, en vue d’établir la procédure qui permet de déceler la présence d’une morphologie fractale, puis de déduire les règles de comportements des suspensions fractales par la modélisation avec les réseaux de neurones. Le domaine des interactions interparticulaires physicochimiques (0 à 10% volumique en solide) a été privilégié.La méthodologie de détermination de la structure multifractale des agglomérats et de la suspension a été proposée. La modification structurale des agglomérats qui est à l’origine de comportements non linéaires des suspensions et qui dépend des propriétés cohésives des particules primaires, est interprétée par la variation de la mobilité électrophorétique des particules en suspension. Une approche d’estimation de ces modifications structurales par les réseaux de neurones, à travers la dimension fractale, a été présentée. Les limites du modèle à assimiler ces comportements particuliers ont été expliquées comme résultant du faible nombre d’exemples et de la grande variabilité des mesures aux faibles fractions volumiques en solide. / The sedimentation of aqueous suspensions of micron-sized mineral particles, polydisperses and concentrated, was analyzed using the Turbiscan MA 2000 based on the multiple light scattering in order to establish the procedure to detect the presence of a fractal morphology, and then to deduce the set of laws of fractal behavior of suspensions by modeling with neural networks. The methodology for determining the multifractal structure of agglomerates and the suspension was proposed. The structural modifications of the agglomerates at the origin of the nonlinear behavior of suspensions and which depends on cohesive properties of primary particles, is interpreted by the change of the electrophoretic mobility of suspended particles. The estimation by neural networks of these structural changes, through the fractal dimension has been presented. The limits of the model to learn these specific behaviors have been explained as resulting from the low number of examples and the great variability in the measurements at low volume fractions of solid. Dimension fractale Diamètre des particules Suspensions Agglomérats Particules cohésives Sédimentation Diffusion multiple de la lumière Réseau de neurones Apprentissage statistique Modélisation non linéaire Fractal dimension Particles diameter Suspensions Agglomerates Cohesive particles Sedimentation Multiple light scattering Neural networks Statistical learning Nonlinear model
114	Méthodes variationnelles pour la segmentation d'images à partir de modèles : applications en imagerie médicale / Variational methods for model-based image segmentation - applications in medical imaging Prevost, Raphaël 21 October 2013 (has links) La segmentation d’images médicales est depuis longtemps un sujet de recherche actif. Cette thèse traite des méthodes de segmentation basées modèles, qui sont un bon compromis entre généricité et capacité d’utilisation d’informations a priori sur l’organe cible. Notre but est de construire un algorithme de segmentation pouvant tirer profit d’une grande variété d’informations extérieures telles que des bases de données annotées (via l’apprentissage statistique), d’autres images du même patient (via la co-segmentation) et des interactions de l’utilisateur. Ce travail est basé sur la déformation de modèle implicite, une méthode variationnelle reposant sur une représentation implicite des formes. Après avoir amélioré sa formulation mathématique, nous montrons son potentiel sur des problèmes cliniques difficiles. Nous introduisons ensuite différentes généralisations, indépendantes mais complémentaires, visant à enrichir le modèle de forme et d’apparence utilisé. La diversité des applications cliniques traitées prouve la généricité et l’efficacité de nos contributions. / Within the wide field of medical imaging research, image segmentation is one of the earliest but still open topics. This thesis focuses on model-based segmentation methods, which achieve a good trade-off between genericity and ability to carry prior information on the target organ. Our goal is to build an efficient segmentation framework that is able to leverage all kinds of external information, i.e. annotated databases via statistical learning, other images from the patient via co-segmentation and user input via live interactions. This work is based on the implicit template deformation framework, a variational method relying on an implicit representation of shapes. After improving the mathematical formulation of this approach, we show its potential on challenging clinical problems. Then, we introduce different generalizations, all independent but complementary, aimed at enriching both the shape and appearance model exploited. The diversity of the clinical applications addressed shows the genericity and the effectiveness of our contributions. Segmentation d'image Imagerie médicale Méthodes basées modèles Méthodes variationnelles Apprentissage statistique Rein Myocarde Échographie Échographie de contraste Image segmentation Medical imaging Model-based methods Variational methods Statistical learning Kidney Myocardium Ultrasound Contrast-enhanced ultrasound
115	Modélisation statistique de la mortalité maternelle et néonatale pour l'aide à la planification et à la gestion des services de santé en Afrique Sub-Saharienne / Statistical modeling of maternal and neonatal mortality for help in planning and management of health services in sub-Saharan Africa Ndour, Cheikh 19 May 2014 (has links) L'objectif de cette thèse est de proposer une méthodologie statistique permettant de formuler une règle de classement capable de surmonter les difficultés qui se présentent dans le traitement des données lorsque la distribution a priori de la variable réponse est déséquilibrée. Notre proposition est construite autour d'un ensemble particulier de règles d'association appelées "class association rules". Dans le chapitre II, nous avons exposé les bases théoriques qui sous-tendent la méthode. Nous avons utilisé les indicateurs de performance usuels existant dans la littérature pour évaluer un classifieur. A chaque règle "class association rule" est associée un classifieur faible engendré par l'antécédent de la règle que nous appelons profils. L'idée de la méthode est alors de combiner un nombre réduit de classifieurs faibles pour constituer une règle de classement performante. Dans le chapitre III, nous avons développé les différentes étapes de la procédure d'apprentissage statistique lorsque les observations sont indépendantes et identiquement distribuées. On distingue trois grandes étapes: (1) une étape de génération d'un ensemble initial de profils, (2) une étape d'élagage de profils redondants et (3) une étape de sélection d'un ensemble optimal de profils. Pour la première étape, nous avons utilisé l'algorithme "apriori" reconnu comme l'un des algorithmes de base pour l'exploration des règles d'association. Pour la deuxième étape, nous avons proposé un test stochastique. Et pour la dernière étape un test asymptotique est effectué sur le rapport des valeurs prédictives positives des classifieurs lorsque les profils générateurs respectifs sont emboîtés. Il en résulte un ensemble réduit et optimal de profils dont la combinaison produit une règle de classement performante. Dans le chapitre IV, nous avons proposé une extension de la méthode d'apprentissage statistique lorsque les observations ne sont pas identiquement distribuées. Il s'agit précisément d'adapter la procédure de sélection de l'ensemble optimal lorsque les données ne sont pas identiquement distribuées. L'idée générale consiste à faire une estimation bayésienne de toutes les valeurs prédictives positives des classifieurs faibles. Par la suite, à l'aide du facteur de Bayes, on effectue un test d'hypothèse sur le rapport des valeurs prédictives positives lorsque les profils sont emboîtés. Dans le chapitre V, nous avons appliqué la méthodologie mise en place dans les chapitres précédents aux données du projet QUARITE concernant la mortalité maternelle au Sénégal et au Mali. / The aim of this thesis is to design a supervised statistical learning methodology that can overcome the weakness of standard methods when the prior distribution of the response variable is unbalanced. The proposed methodology is built using class association rules. Chapter II deals with theorical basis of statistical learning method by relating various classifiers performance metrics with class association rules. Since the classifier corresponding to a class association rules is a weak classifer, we propose to select a small number of such weak classifiers and to combine them in the aim to build an efficient classifier. In Chapter III, we develop the different steps of the statistical learning method when observations are independent and identically distributed. There are three main steps: In the first step, an initial set of patterns correlated with the target class is generated using "apriori" algorithm. In the second step, we propose a hypothesis test to prune redondant patterns. In the third step, an hypothesis test is performed based on the ratio of the positive predictive values of the classifiers when respective generating patterns are nested. This results in a reduced and optimal set of patterns whose combination provides an efficient classifier. In Chapter IV, we extend the classification method that we proposed in Chapter III in order to handle the case where observations are not identically distributed. The aim being here to adapt the procedure for selecting the optimal set of patterns when data are grouped data. In this setting we compute the estimation of the positive predictive values as the mean of the posterior distribution of the target class probability by using empirical Bayes method. Thereafter, using Bayes factor, a hypothesis test based on the ratio of the positive predictive values is carried out when patterns are nested. Chapter V is devoted to the application of the proposed methodology to process a real world dataset. We studied the QUARITE project dataset on maternal mortality in Senegal and Mali in order to provide a decision making tree that health care professionals can refer to when managing patients delivering in their health facilities. Apprentissage statistique Classement Données déséquilibrées Estimation Bayésienne empirique Mortalité maternelle Profils Règles d'association Sélection de profils Test d'hypothèse Association rules Classification Empirical Bayesian estimation Hypothesis testing Maternal mortality Patterns Selection patterns Statistical learning Unbalanced data
116	L'approche Support Vector Machines (SVM) pour le traitement des données fonctionnelles / Support Vector Machines (SVM) for Fonctional Data Analysis Henchiri, Yousri 16 October 2013 (has links) L'Analyse des Données Fonctionnelles est un domaine important et dynamique en statistique. Elle offre des outils efficaces et propose de nouveaux développements méthodologiques et théoriques en présence de données de type fonctionnel (fonctions, courbes, surfaces, ...). Le travail exposé dans cette thèse apporte une nouvelle contribution aux thèmes de l'apprentissage statistique et des quantiles conditionnels lorsque les données sont assimilables à des fonctions. Une attention particulière a été réservée à l'utilisation de la technique Support Vector Machines (SVM). Cette technique fait intervenir la notion d'Espace de Hilbert à Noyau Reproduisant. Dans ce cadre, l'objectif principal est d'étendre cette technique non-paramétrique d'estimation aux modèles conditionnels où les données sont fonctionnelles. Nous avons étudié les aspects théoriques et le comportement pratique de la technique présentée et adaptée sur les modèles de régression suivants. Le premier modèle est le modèle fonctionnel de quantiles de régression quand la variable réponse est réelle, les variables explicatives sont à valeurs dans un espace fonctionnel de dimension infinie et les observations sont i.i.d.. Le deuxième modèle est le modèle additif fonctionnel de quantiles de régression où la variable d'intérêt réelle dépend d'un vecteur de variables explicatives fonctionnelles. Le dernier modèle est le modèle fonctionnel de quantiles de régression quand les observations sont dépendantes. Nous avons obtenu des résultats sur la consistance et les vitesses de convergence des estimateurs dans ces modèles. Des simulations ont été effectuées afin d'évaluer la performance des procédures d'inférence. Des applications sur des jeux de données réelles ont été considérées. Le bon comportement de l'estimateur SVM est ainsi mis en évidence. / Functional Data Analysis is an important and dynamic area of statistics. It offers effective new tools and proposes new methodological and theoretical developments in the presence of functional type data (functions, curves, surfaces, ...). The work outlined in this dissertation provides a new contribution to the themes of statistical learning and quantile regression when data can be considered as functions. Special attention is devoted to use the Support Vector Machines (SVM) technique, which involves the notion of a Reproducing Kernel Hilbert Space. In this context, the main goal is to extend this nonparametric estimation technique to conditional models that take into account functional data. We investigated the theoretical aspects and practical attitude of the proposed and adapted technique to the following regression models.The first model is the conditional quantile functional model when the covariate takes its values in a bounded subspace of the functional space of infinite dimension, the response variable takes its values in a compact of the real line, and the observations are i.i.d.. The second model is the functional additive quantile regression model where the response variable depends on a vector of functional covariates. The last model is the conditional quantile functional model in the dependent functional data case. We obtained the weak consistency and a convergence rate of these estimators. Simulation studies are performed to evaluate the performance of the inference procedures. Applications to chemometrics, environmental and climatic data analysis are considered. The good behavior of the SVM estimator is thus highlighted. Analyse des Données Fonctionnelles Support Vector Machines Quantiles de régression Apprentissage statistique Apprentissage supervisé Espace de Hilbert à noyau reproduisant Functional Data Analysis Support Vector Machines Quantile Regression Statistical learning Supervised learning Reproducing kernel Hilbert space
117	A Unified View of Local Learning : Theory and Algorithms for Enhancing Linear Models / Une Vue Unifiée de l'Apprentissage Local : Théorie et Algorithmes pour l'Amélioration de Modèles Linéaires Zantedeschi, Valentina 18 December 2018 (has links) Dans le domaine de l'apprentissage machine, les caractéristiques des données varient généralement dans l'espace des entrées : la distribution globale pourrait être multimodale et contenir des non-linéarités. Afin d'obtenir de bonnes performances, l'algorithme d'apprentissage devrait alors être capable de capturer et de s'adapter à ces changements. Même si les modèles linéaires ne parviennent pas à décrire des distributions complexes, ils sont réputés pour leur passage à l'échelle, en entraînement et en test, aux grands ensembles de données en termes de nombre d'exemples et de nombre de fonctionnalités. Plusieurs méthodes ont été proposées pour tirer parti du passage à l'échelle et de la simplicité des hypothèses linéaires afin de construire des modèles aux grandes capacités discriminatoires. Ces méthodes améliorent les modèles linéaires, dans le sens où elles renforcent leur expressivité grâce à différentes techniques. Cette thèse porte sur l'amélioration des approches d'apprentissage locales, une famille de techniques qui infère des modèles en capturant les caractéristiques locales de l'espace dans lequel les observations sont intégrées.L'hypothèse fondatrice de ces techniques est que le modèle appris doit se comporter de manière cohérente sur des exemples qui sont proches, ce qui implique que ses résultats doivent aussi changer de façon continue dans l'espace des entrées. La localité peut être définie sur la base de critères spatiaux (par exemple, la proximité en fonction d'une métrique choisie) ou d'autres relations fournies, telles que l'association à la même catégorie d'exemples ou un attribut commun. On sait que les approches locales d'apprentissage sont efficaces pour capturer des distributions complexes de données, évitant de recourir à la sélection d'un modèle spécifique pour la tâche. Cependant, les techniques de pointe souffrent de trois inconvénients majeurs :ils mémorisent facilement l'ensemble d'entraînement, ce qui se traduit par des performances médiocres sur de nouvelles données ; leurs prédictions manquent de continuité dans des endroits particuliers de l'espace ; elles évoluent mal avec la taille des ensembles des données. Les contributions de cette thèse examinent les problèmes susmentionnés dans deux directions : nous proposons d'introduire des informations secondaires dans la formulation du problème pour renforcer la continuité de la prédiction et atténuer le phénomène de la mémorisation ; nous fournissons une nouvelle représentation de l'ensemble de données qui tient compte de ses spécificités locales et améliore son évolutivité. Des études approfondies sont menées pour mettre en évidence l'efficacité de ces contributions pour confirmer le bien-fondé de leurs intuitions. Nous étudions empiriquement les performances des méthodes proposées tant sur des jeux de données synthétiques que sur des tâches réelles, en termes de précision et de temps d'exécution, et les comparons aux résultats de l'état de l'art. Nous analysons également nos approches d'un point de vue théorique, en étudiant leurs complexités de calcul et de mémoire et en dérivant des bornes de généralisation serrées. / In Machine Learning field, data characteristics usually vary over the space: the overall distribution might be multi-modal and contain non-linearities.In order to achieve good performance, the learning algorithm should then be able to capture and adapt to these changes. Even though linear models fail to describe complex distributions, they are renowned for their scalability, at training and at testing, to datasets big in terms of number of examples and of number of features. Several methods have been proposed to take advantage of the scalability and the simplicity of linear hypotheses to build models with great discriminatory capabilities. These methods empower linear models, in the sense that they enhance their expressive power through different techniques. This dissertation focuses on enhancing local learning approaches, a family of techniques that infers models by capturing the local characteristics of the space in which the observations are embedded. The founding assumption of these techniques is that the learned model should behave consistently on examples that are close, implying that its results should also change smoothly over the space. The locality can be defined on spatial criteria (e.g. closeness according to a selected metric) or other provided relations, such as the association to the same category of examples or a shared attribute. Local learning approaches are known to be effective in capturing complex distributions of the data, avoiding to resort to selecting a model specific for the task. However, state of the art techniques suffer from three major drawbacks: they easily memorize the training set, resulting in poor performance on unseen data; their predictions lack of smoothness in particular locations of the space;they scale poorly with the size of the datasets. The contributions of this dissertation investigate the aforementioned pitfalls in two directions: we propose to introduce side information in the problem formulation to enforce smoothness in prediction and attenuate the memorization phenomenon; we provide a new representation for the dataset which takes into account its local specificities and improves scalability. Thorough studies are conducted to highlight the effectiveness of the said contributions which confirmed the soundness of their intuitions. We empirically study the performance of the proposed methods both on toy and real tasks, in terms of accuracy and execution time, and compare it to state of the art results. We also analyze our approaches from a theoretical standpoint, by studying their computational and memory complexities and by deriving tight generalization bounds. Apprentissage Machine Algorithme d'apprentissage Apprentissage local Apprentissage décentralisé Apprentissage métrique Garanties de généralisation Apprentissage multi-vues Machine Learning Statistical Learning Local Learning Decentralized Learning Metric Learning Generalization Guarantees Multi-view Learning Landmarks
118	Statistical modeling of protein sequences beyond structural prediction : high dimensional inference with correlated data / Modélisation statistique des séquences de protéines au-delà de la prédiction structurelle : inférence en haute dimension avec des données corrélées Coucke, Alice 10 October 2016 (has links) Grâce aux progrès des techniques de séquençage, les bases de données génomiques ont connu une croissance exponentielle depuis la fin des années 1990. Un grand nombre d'outils statistiques ont été développés à l'interface entre bioinformatique, apprentissage automatique et physique statistique, dans le but d'extraire de l'information de ce déluge de données. Plusieurs approches de physique statistique ont été récemment introduites dans le contexte précis de la modélisation de séquences de protéines, dont l'analyse en couplages directs. Cette méthode d'inférence statistique globale fondée sur le principe d'entropie maximale, s'est récemment montrée d'une efficacité redoutable pour prédire la structure tridimensionnelle de protéines, à partir de considérations purement statistiques.Dans cette thèse, nous présentons les méthodes d'inférence en question, et encouragés par leur succès, explorons d'autres domaines complexes dans lesquels elles pourraient être appliquées, comme la détection d'homologies. Contrairement à la prédiction des contacts entre résidus qui se limite à une information topologique sur le réseau d'interactions, ces nouveaux champs d'application exigent des considérations énergétiques globales et donc un modèle plus quantitatif et détaillé. À travers une étude approfondie sur des donnéesartificielles et biologiques, nous proposons une meilleure interpretation des paramètres centraux de ces méthodes d'inférence, jusqu'ici mal compris, notamment dans le cas d'un échantillonnage limité. Enfin, nous présentons une nouvelle procédure plus précise d'inférence de modèles génératifs, qui mène à des avancées importantes pour des données réelles en quantité limitée. / Over the last decades, genomic databases have grown exponentially in size thanks to the constant progress of modern DNA sequencing. A large variety of statistical tools have been developed, at the interface between bioinformatics, machine learning, and statistical physics, to extract information from these ever increasing datasets. In the specific context of protein sequence data, several approaches have been recently introduced by statistical physicists, such as direct-coupling analysis, a global statistical inference method based on the maximum-entropy principle, that has proven to be extremely effective in predicting the three-dimensional structure of proteins from purely statistical considerations.In this dissertation, we review the relevant inference methods and, encouraged by their success, discuss their extension to other challenging fields, such as sequence folding prediction and homology detection. Contrary to residue-residue contact prediction, which relies on an intrinsically topological information about the network of interactions, these fields require global energetic considerations and therefore a more quantitative and detailed model. Through an extensive study on both artificial and biological data, we provide a better interpretation of the central inferred parameters, up to now poorly understood, especially in the limited sampling regime. Finally, we present a new and more precise procedure for the inference of generative models, which leads to further improvements on real, finitely sampled data. Inférence Apprentissage statistique Régularisation Entropie maximale Ccoévolution des protéines Vraisemblance maximale Champ moyen Pseudo vraisemblance Développement en grappe Inference Statistical learning Regularization Maximum entropy Protein coevolution Maximum likelihood Mean field Pseudolikelihood Cluster expansion 530.13
119	Regularization in reinforcement learning Farahmand, Amir-massoud Unknown Date No description available. Reinforcement Learning Machine Learning Statistical Learning Theory Sequential Decision-Making Problems Regularization Approximate Value/Policy Iteration Model Selection Regularized Least-Squares Regression Regularized Policy Iteration Regularized Fitted Q-Iteration Regularized LSTD Error Propagation
120	Μεθοδολογία στατιστικής μάθησης για την πρόγνωση ασθενών με τη Β-χρόνια λεμφογενή λευχαιμία (Β-ΧΛΛ) με χρήση δεδομένων κυτταρομετρίας ροής / Statistical learning methodology for the prognosis of B-chronic lymphocytic leukemia (B-CLL) using flow cytometry data Λακουμέντας, Ιωάννης 20 April 2011 (has links) Η Β-χρόνια Λεμφογενής Λευχαιμία (Β-ΧΛΛ) αποτελεί τον πιο κοινό τύπο λευχαιμίας στο Δυτικό κόσμο. Η πρόγνωσή της θεωρείται ως ένα από τα πιο ενδιαφέροντα προβλήματα απόφασης στην κλινική έρευνα και πρακτική. Για διάφορους κλινικούς και εργαστηριακούς δείκτες είναι γνωστό ότι σχετίζονται με την εξέλιξη της νόσου. Για τις παραμέτρους, όμως, που εξάγονται με ανάλυση κυτταρομετρίας ροής, οι οποίες αποτελούν τον ακρογωνιαίο λίθο της διαδικασίας διάγνωσης της νόσου, το αν προσφέρουν επιπρόσθετη προγνωστική πληροφορία αποτελεί ανοιχτό πρόβλημα. Στη διατριβή αυτή προτείνουμε ένα σύστημα υποβοήθησης για τις αποφάσεις των ειδικών του πεδίου, το οποίο πραγματοποιεί πολυπαραμετρική πρόγνωση ασθενών με Β-ΧΛΛ, συνδυάζοντας τη χρήση ποικίλων ετερογενών προγνωστικών δεικτών (κλινικών, εργαστηριακών και κυτταρομετρίας ροής) που σχετίζονται με τη νόσο. Η διάγνωση της Β-ΧΛΛ βασίζεται κυρίως στη μελέτη του αντιγονικού φαινότυπου των κυττάρων των ασθενών, η οποία διενεργείται με κυτταρομετρία ροής. Αν και η διαδικασία που ακολουθείται κατά την ανάλυση αυτή είναι σαφώς ορισμένη, ο τρόπος με τον οποίο οι εργαστηριακοί υπεύθυνοι την πραγματοποιούν παραδοσιακά χαρακτηρίζεται από ανακρίβεια και υποκειμενικότητα. Καθώς η τεχνολογία της κυτταρομετρίας ροής εξελίσσεται ραγδαία, γίνεται όλο και πιο επιτακτική η ανάγκη για την ανάπτυξη αυτοματοποιημένων μεθόδων ανάλυσης των δεδομένων που παράγει. Σε αυτά τα πλαίσια, παρουσιάζουμε ένα χρήσιμο παράδειγμα αυτοματοποιημένης ανάλυσης κυτταρομετρικών δεδομένων, η οποία δεν απαιτεί την άμεση επίβλεψη των ειδικών, για τη διάγνωση ασθενών με Β-ΧΛΛ. Οι τιμές των χαρακτηριστικών παραμέτρων που εξάγονται με εφαρμογή της προτεινόμενης μεθοδολογίας, ενσωματώνονται κατόπιν στο προαναφερθέν προγνωστικό σύστημα. Ανάγοντας το πρόβλημα της πρόγνωσης της Β-ΧΛΛ σε ένα στιγμιότυπο ταξινόμησης προτύπων, καθώς και προσομοιώνοντας κάθε ένα από τα βήματα της διαδικασίας της διάγνωσης της νόσου με ένα στιγμιότυπο συσταδοποίησης δεδομένων, αντιμετωπίσαμε τα δύο προβλήματα εφαρμόζοντας τεχνικές στατιστικής μάθησης. Εστιάσαμε σε μεθοδολογίες δικτύων πεποίθησης, χρησιμοποιώντας συγκεκριμένα το naïve-Bayes μοντέλο και για τις δύο περιπτώσεις, στην επιβλεπόμενη και στη μη επιβλεπόμενη εκδοχή του, αντίστοιχα. Τα χαρακτηριστικά και η φύση των δεδομένων (κυρίως των κυτταρομετρικών) που παράγονται από έναν παθολογικό υποκείμενο μηχανισμό, όπως αυτός της νόσου, δεν ευνοούν την απευθείας εφαρμογή του παραπάνω μοντέλου στο εκάστοτε στιγμιότυπο. Για το λόγο αυτό, συνδυάσαμε την εφαρμογή του naïve-Bayes μοντέλου με κατάλληλες ευρετικές αλγοριθμικές διαδικασίες, για την επίτευξη καλύτερων αποτελεσμάτων, με κριτήριο βέλτιστου όχι μόνο κάποιες συχνά χρησιμοποιούμενες μετρικές αποτίμησης αλγόριθμων, αλλά και τη γνώμη των αιματολόγων. Χάρη στην ιδιότητά τους να ενσωματώνουν την έμπειρη γνώση των ειδικών ως εκ των προτέρων πληροφορία αρχικοποίησης των μεθόδων μάθησής τους, οι Bayesian μεθοδολογίες κρίνονται ως οι πλέον κατάλληλες για την εφαρμογή τους σε τέτοιου τύπου προβλήματα. / B-Chronic Lymphocytic Leukemia (B-CLL) is known to be the most common type of leukemia in the Western world. Its prognosis remains one of the most interesting decision problems in clinical research and practice. Various clinical and laboratory factors are known to be associated with the evolution of the disease. However, for the parameters obtained by flow cytometry analysis, that are traditionally utilized as the cornerstone during the diagnosis procedure of the disease, whether they offer additional prognostic information is an open issue. In this dissertation, we propose a decision support system to the hematologists, that provides multiparametric B-CLL patients’ prognosis, combining the usage of diverse heterogeneous factors (clinical, laboratory and flow cytometry) associated with the disease. B-CLL diagnosis is primarily derived from the study of the antigenic phenotype of the patients’ blood cells, which is held with flow cytometry analysis. Despite the fact that the method of the analysis is well defined, the process traditionally followed by the laboratory experts is characterized by amounts of inexactness and subjectivity. As flow cytometry technology advances rapidly, the need for adequate automated (computer-assisted) analysis methodologies on the data it produces is accordingly increasing. In this context, we present a useful paradigm of automated analysis of flow cytometry data, that does not require the direct supervision of the expert, for B-CLL patients’ diagnosis. The values of the flow cytometry characteristic parameters extracted by applying the proposed methodology are afterward incorporated to the prognostic system for B-CLL mentioned above. By reducing the B-CLL prognosis problem to an instance of the pattern classification problem, as well as by simulating each step of the B-CLL diagnosis procedure with an instance of the data classification problem, we proceeded with applying statistical learning techniques. We focused on Bayesian network methodologies and utilized the naïve-Bayes model for both cases, in its supervised and unsupervised version, respectively. The characteristics of the data (especially of the flow cytometry ones) generated by a pathological underlying mechanism, like the disease’s one, did not encourage the direct use of the above model. Therefore, we combined the naïve-Bayes model with a set of suitable heuristic algorithmic procedures to obtain better results, not only with respect to some commonly used algorithmic optimality metrics, but also by considering the experts’ opinion. Due to their ability of incorporating the expert knowledge as a priori initial information to their learning methods, Bayesian methodologies are considered as the most appropriate ones to make use of in such types of applications. Κυτταρομετρία ροής Στατιστική μάθηση Δίκτυα πεποίθησης Εξόρυξη δεδομένων 616.994 190 75 B-chronic lymphocytic leukemia (B-CLL) Flow cytometry Statistical learning Belief networks Data mining

Search results