Global ETD Search

481	Hard and fuzzy block clustering algorithms for high dimensional data / Algorithmes de block-clustering dur et flou pour les données en grande dimension Laclau, Charlotte 14 April 2016 (has links) Notre capacité grandissante à collecter et stocker des données a fait de l'apprentissage non supervisé un outil indispensable qui permet la découverte de structures et de modèles sous-jacents aux données, sans avoir à \étiqueter les individus manuellement. Parmi les différentes approches proposées pour aborder ce type de problème, le clustering est très certainement le plus répandu. Le clustering suppose que chaque groupe, également appelé cluster, est distribué autour d'un centre défini en fonction des valeurs qu'il prend pour l'ensemble des variables. Cependant, dans certaines applications du monde réel, et notamment dans le cas de données de dimension importante, cette hypothèse peut être invalidée. Aussi, les algorithmes de co-clustering ont-ils été proposés: ils décrivent les groupes d'individus par un ou plusieurs sous-ensembles de variables au regard de leur pertinence. La structure des données finalement obtenue est composée de blocs communément appelés co-clusters. Dans les deux premiers chapitres de cette thèse, nous présentons deux approches de co-clustering permettant de différencier les variables pertinentes du bruit en fonction de leur capacité \`a révéler la structure latente des données, dans un cadre probabiliste d'une part et basée sur la notion de métrique, d'autre part. L'approche probabiliste utilise le principe des modèles de mélanges, et suppose que les variables non pertinentes sont distribuées selon une loi de probabilité dont les paramètres sont indépendants de la partition des données en cluster. L'approche métrique est fondée sur l'utilisation d'une distance adaptative permettant d'affecter à chaque variable un poids définissant sa contribution au co-clustering. D'un point de vue théorique, nous démontrons la convergence des algorithmes proposés en nous appuyant sur le théorème de convergence de Zangwill. Dans les deux chapitres suivants, nous considérons un cas particulier de structure en co-clustering, qui suppose que chaque sous-ensemble d'individus et décrit par un unique sous-ensemble de variables. La réorganisation de la matrice originale selon les partitions obtenues sous cette hypothèse révèle alors une structure de blocks homogènes diagonaux. Comme pour les deux contributions précédentes, nous nous plaçons dans le cadre probabiliste et métrique. L'idée principale des méthodes proposées est d'imposer deux types de contraintes : (1) nous fixons le même nombre de cluster pour les individus et les variables; (2) nous cherchons une structure de la matrice de données d'origine qui possède les valeurs maximales sur sa diagonale (par exemple pour le cas des données binaires, on cherche des blocs diagonaux majoritairement composés de valeurs 1, et de 0 à l’extérieur de la diagonale). Les approches proposées bénéficient des garanties de convergence issues des résultats des chapitres précédents. Enfin, pour chaque chapitre, nous dérivons des algorithmes permettant d'obtenir des partitions dures et floues. Nous évaluons nos contributions sur un large éventail de données simulées et liées a des applications réelles telles que le text mining, dont les données peuvent être binaires ou continues. Ces expérimentations nous permettent également de mettre en avant les avantages et les inconvénients des différentes approches proposées. Pour conclure, nous pensons que cette thèse couvre explicitement une grande majorité des scénarios possibles découlant du co-clustering flou et dur, et peut être vu comme une généralisation de certaines approches de biclustering populaires. / With the increasing number of data available, unsupervised learning has become an important tool used to discover underlying patterns without the need to label instances manually. Among different approaches proposed to tackle this problem, clustering is arguably the most popular one. Clustering is usually based on the assumption that each group, also called cluster, is distributed around a center defined in terms of all features while in some real-world applications dealing with high-dimensional data, this assumption may be false. To this end, co-clustering algorithms were proposed to describe clusters by subsets of features that are the most relevant to them. The obtained latent structure of data is composed of blocks usually called co-clusters. In first two chapters, we describe two co-clustering methods that proceed by differentiating the relevance of features calculated with respect to their capability of revealing the latent structure of the data in both probabilistic and distance-based framework. The probabilistic approach uses the mixture model framework where the irrelevant features are assumed to have a different probability distribution that is independent of the co-clustering structure. On the other hand, the distance-based (also called metric-based) approach relied on the adaptive metric where each variable is assigned with its weight that defines its contribution in the resulting co-clustering. From the theoretical point of view, we show the global convergence of the proposed algorithms using Zangwill convergence theorem. In the last two chapters, we consider a special case of co-clustering where contrary to the original setting, each subset of instances is described by a unique subset of features resulting in a diagonal structure of the initial data matrix. Same as for the two first contributions, we consider both probabilistic and metric-based approaches. The main idea of the proposed contributions is to impose two different kinds of constraints: (1) we fix the number of row clusters to the number of column clusters; (2) we seek a structure of the original data matrix that has the maximum values on its diagonal (for instance for binary data, we look for diagonal blocks composed of ones with zeros outside the main diagonal). The proposed approaches enjoy the convergence guarantees derived from the results of the previous chapters. Finally, we present both hard and fuzzy versions of the proposed algorithms. We evaluate our contributions on a wide variety of synthetic and real-world benchmark binary and continuous data sets related to text mining applications and analyze advantages and inconvenients of each approach. To conclude, we believe that this thesis covers explicitly a vast majority of possible scenarios arising in hard and fuzzy co-clustering and can be seen as a generalization of some popular biclustering approaches. Classification Flou Classification croisée Modèle de mélange Approche métrique Modèle à bloc latent Données sparses Données binaires Classification de document Théorème de Zangwill Sélection de variable Données en grande dimension Algorithme Clustering Fuzzy Co-clustering Mixture model Metric approach Latent block model Sparse data Binary data Document clustering Zangwill theorem Feature selection High dimensional data Algorithm 004
482	Apprentissage basé sur le Qini pour la prédiction de l’effet causal conditionnel Belbahri, Mouloud-Beallah 08 1900 (has links) Les modèles uplift (levier en français) traitent de l'inférence de cause à effet pour un facteur spécifique, comme une intervention de marketing. En pratique, ces modèles sont construits sur des données individuelles issues d'expériences randomisées. Un groupe traitement comprend des individus qui font l'objet d'une action; un groupe témoin sert de comparaison. La modélisation uplift est utilisée pour ordonner les individus par rapport à la valeur d'un effet causal, par exemple, positif, neutre ou négatif. Dans un premier temps, nous proposons une nouvelle façon d'effectuer la sélection de modèles pour la régression uplift. Notre méthodologie est basée sur la maximisation du coefficient Qini. Étant donné que la sélection du modèle correspond à la sélection des variables, la tâche est difficile si elle est effectuée de manière directe lorsque le nombre de variables à prendre en compte est grand. Pour rechercher de manière réaliste un bon modèle, nous avons conçu une méthode de recherche basée sur une exploration efficace de l'espace des coefficients de régression combinée à une pénalisation de type lasso de la log-vraisemblance. Il n'y a pas d'expression analytique explicite pour la surface Qini, donc la dévoiler n'est pas facile. Notre idée est de découvrir progressivement la surface Qini comparable à l'optimisation sans dérivée. Le but est de trouver un maximum local raisonnable du Qini en explorant la surface près des valeurs optimales des coefficients pénalisés. Nous partageons ouvertement nos codes à travers la librairie R tools4uplift. Bien qu'il existe des méthodes de calcul disponibles pour la modélisation uplift, la plupart d'entre elles excluent les modèles de régression statistique. Notre librairie entend combler cette lacune. Cette librairie comprend des outils pour: i) la discrétisation, ii) la visualisation, iii) la sélection de variables, iv) l'estimation des paramètres et v) la validation du modèle. Cette librairie permet aux praticiens d'utiliser nos méthodes avec aise et de se référer aux articles méthodologiques afin de lire les détails. L'uplift est un cas particulier d'inférence causale. L'inférence causale essaie de répondre à des questions telle que « Quel serait le résultat si nous donnions à ce patient un traitement A au lieu du traitement B? ». La réponse à cette question est ensuite utilisée comme prédiction pour un nouveau patient. Dans la deuxième partie de la thèse, c’est sur la prédiction que nous avons davantage insisté. La plupart des approches existantes sont des adaptations de forêts aléatoires pour le cas de l'uplift. Plusieurs critères de segmentation ont été proposés dans la littérature, tous reposant sur la maximisation de l'hétérogénéité. Cependant, dans la pratique, ces approches sont sujettes au sur-ajustement. Nous apportons une nouvelle vision pour améliorer la prédiction de l'uplift. Nous proposons une nouvelle fonction de perte définie en tirant parti d'un lien avec l'interprétation bayésienne du risque relatif. Notre solution est développée pour une architecture de réseau de neurones jumeaux spécifique permettant d'optimiser conjointement les probabilités marginales de succès pour les individus traités et non-traités. Nous montrons que ce modèle est une généralisation du modèle d'interaction logistique de l'uplift. Nous modifions également l'algorithme de descente de gradient stochastique pour permettre des solutions parcimonieuses structurées. Cela aide dans une large mesure à ajuster nos modèles uplift. Nous partageons ouvertement nos codes Python pour les praticiens désireux d'utiliser nos algorithmes. Nous avons eu la rare opportunité de collaborer avec l'industrie afin d'avoir accès à des données provenant de campagnes de marketing à grande échelle favorables à l'application de nos méthodes. Nous montrons empiriquement que nos méthodes sont compétitives avec l'état de l'art sur les données réelles ainsi qu'à travers plusieurs scénarios de simulations. / Uplift models deal with cause-and-effect inference for a specific factor, such as a marketing intervention. In practice, these models are built on individual data from randomized experiments. A targeted group contains individuals who are subject to an action; a control group serves for comparison. Uplift modeling is used to order the individuals with respect to the value of a causal effect, e.g., positive, neutral, or negative. First, we propose a new way to perform model selection in uplift regression models. Our methodology is based on the maximization of the Qini coefficient. Because model selection corresponds to variable selection, the task is haunting and intractable if done in a straightforward manner when the number of variables to consider is large. To realistically search for a good model, we conceived a searching method based on an efficient exploration of the regression coefficients space combined with a lasso penalization of the log-likelihood. There is no explicit analytical expression for the Qini surface, so unveiling it is not easy. Our idea is to gradually uncover the Qini surface in a manner inspired by surface response designs. The goal is to find a reasonable local maximum of the Qini by exploring the surface near optimal values of the penalized coefficients. We openly share our codes through the R Package tools4uplift. Though there are some computational methods available for uplift modeling, most of them exclude statistical regression models. Our package intends to fill this gap. This package comprises tools for: i) quantization, ii) visualization, iii) variable selection, iv) parameters estimation and v) model validation. This library allows practitioners to use our methods with ease and to refer to methodological papers in order to read the details. Uplift is a particular case of causal inference. Causal inference tries to answer questions such as ``What would be the result if we gave this patient treatment A instead of treatment B?" . The answer to this question is then used as a prediction for a new patient. In the second part of the thesis, it is on the prediction that we have placed more emphasis. Most existing approaches are adaptations of random forests for the uplift case. Several split criteria have been proposed in the literature, all relying on maximizing heterogeneity. However, in practice, these approaches are prone to overfitting. In this work, we bring a new vision to uplift modeling. We propose a new loss function defined by leveraging a connection with the Bayesian interpretation of the relative risk. Our solution is developed for a specific twin neural network architecture allowing to jointly optimize the marginal probabilities of success for treated and control individuals. We show that this model is a generalization of the uplift logistic interaction model. We modify the stochastic gradient descent algorithm to allow for structured sparse solutions. This helps fitting our uplift models to a great extent. We openly share our Python codes for practitioners wishing to use our algorithms. We had the rare opportunity to collaborate with industry to get access to data from large-scale marketing campaigns favorable to the application of our methods. We show empirically that our methods are competitive with the state of the art on real data and through several simulation setting scenarios. Descente de gradient Discrétisation Fonction de perte Inférence causale Optimisation sans dérivée Régression logistique Régularisation Réseau de neurones artificiels Sélection de variables Gradient descent Quantization Heterogeneous treatment effects Loss function Causal inference Derivative-free optimization Logistic regression Regularization Artificial neural network Feature selection Effets hétérogènes du traitement
483	Feature selection in short-term load forecasting / Val av attribut vid kortvarig lastprognos för energiförbrukning Söderberg, Max Joel, Meurling, Axel January 2019 (has links) This paper investigates correlation between energy consumption 24 hours ahead and features used for predicting energy consumption. The features originate from three categories: weather, time and previous energy. The correlations are calculated using Pearson correlation and mutual information. This resulted in the highest correlated features being those representing previous energy consumption, followed by temperature and month. Two identical feature sets containing all attributes1 were obtained by ranking the features according to correlation. Three feature sets were created manually. The first set contained seven attributes representing previous energy consumption over the course of seven days prior to the day of prediction. The second set consisted of weather and time attributes. The third set consisted of all attributes from the first and second set. These sets were then compared on different machine learning models. It was found the set containing all attributes and the set containing previous energy attributes yielded the best performance for each machine learning model. 1In this report, the words ”attribute” and ”feature” are used interchangeably. / I denna rapport undersöks korrelation och betydelsen av olika attribut för att förutspå energiförbrukning 24 timmar framåt. Attributen härstammar från tre kategorier: väder, tid och tidigare energiförbrukning. Korrelationerna tas fram genom att utföra Pearson Correlation och Mutual Information. Detta resulterade i att de högst korrelerade attributen var de som representerar tidigare energiförbrukning, följt av temperatur och månad. Två identiska attributmängder erhölls genom att ranka attributen över korrelation. Tre attributmängder skapades manuellt. Den första mängden innehåll sju attribut som representerade tidigare energiförbrukning, en för varje dag, sju dagar innan datumet för prognosen av energiförbrukning. Den andra mängden bestod av väderoch tidsattribut. Den tredje mängden bestod av alla attribut från den första och andra mängden. Dessa mängder jämfördes sedan med hjälp av olika maskininlärningsmodeller. Resultaten visade att mängden med alla attribut och den med tidigare energiförbrukning gav bäst resultat för samtliga modeller. Short-term load forecasting energy consumption forecasting Linear regression SVR Random Forest machine learning regression feature selection attribute selection Pearson correlation Mutual information correlation matrix Two-way ANOVA Tukey’s HSD test. Kortsiktig lastprognos Energiförbrukningsprognos Linjär regression SVR Random forest Maskininlärning Attributval Pearson-korrelation Ömsesidig information Korrelationsmatris Tvåvägs ANOVA Tukey’s HSD-test. Computer and Information Sciences Data- och informationsvetenskap
484	Data-Driven Success in Infrastructure Megaprojects. : Leveraging Machine Learning and Expert Insights for Enhanced Prediction and Efficiency / Datadriven framgång inom infrastrukturmegaprojekt. : Utnyttja maskininlärning och expertkunskap för förbättrad prognostisering och effektivitet. Nordmark, David E.G. January 2023 (has links) This Master's thesis utilizes random forest and leave-one-out cross-validation to predict the success of megaprojects involving infrastructure. The goal was to enhance the efficiency of the design and engineering phase of the infrastructure and construction industries. Due to the small sample size of megaprojects and limitated data sharing, the lack of data poses significant challenges for implementing artificial intelligence for the evaluation and prediction of megaprojects. This thesis explore how megaprojects can benefit from data collection and machine learning despite small sample sizes. The focus of the research was on analyzing data from thirteen megaprojects and identifying the most influential data for machine learning analysis. The results prove that the incorporation of expert data, representing critical success factors for megaprojects, significantly enhanced the accuracy of the predictive model. The superior performance of expert data over economic data, experience data, and documentation data demonstrates the significance of domain expertise. In addition, the results demonstrate the significance of the planning phase by implementing feature selection techniques and feature importance scores. In the planning phase, a small, devoted, and highly experienced team of project planners has proven to be a crucial factor for project success. The thesis concludes that in order for companies to maximize the utility of machine learning, they must identify their critical success factors and collect the corresponding data. / Denna magisteruppsats undersöker följande forskningsfråga: Hur kan maskininlärning och insiktsfull dataanalys användas för att öka effektiviteten i infrastruktursektorns plannerings- och designfas? Denna utmaning löses genom att analysera data från verkliga megaprojekt och tillämpa avancerade maskininlärningsalgoritmer för att förutspå projektframgång och ta reda på framgångsfaktorerna. Vår forskning är särskilt intresserad av megaprojekt på grund av deras komplicerade natur, unika egenskaper och enorma inverkan på samhället. Dessa projekt slutförs sällan, vilket gör att det är svårt att få tillgång till stora mängder verklig data. Det är uppenbart att AI har potential att vara ett ovärderligt verktyg för att förstå och hantera megaprojekts komplexitet, trots de problem vi står inför. Artificiell intelligens gör det möjligt att fatta beslut som är datadrivna och mer informerade. Uppsatsen lyckas med att hanterard det stora problemet som är bristen på data från megaprojekt. Uppsatsen motiveras även av denna brist på data, vilket gör forskningen relevant för andra områden som präglas av litet dataurval. Resultaten från uppsatsen visar att evalueringen av megaprojekt går att förbättra genom smart användning av specifika dataattribut. Uppsatsen inspirerar även företag att börja samla in viktig data för att möjliggöra användningen av artificiell intelligens och maskinginlärning till sin fördel. Megaproject Small sample size Project management Random forest Critical success factors Feature selection Recursive feature elimination Megaprojekt Små dataurval Projektledning Random forest Kritiska framgångsfaktorer Variabel urval Rekursiv variabel eliminering Computer Sciences Datavetenskap (datalogi) Computer Engineering Datorteknik Computer and Information Sciences Data- och informationsvetenskap
485	Tuning of machine learning algorithms for automatic bug assignment Artchounin, Daniel January 2017 (has links) In software development projects, bug triage consists mainly of assigning bug reports to software developers or teams (depending on the project). The partial or total automation of this task would have a positive economic impact on many software projects. This thesis introduces a systematic four-step method to find some of the best configurations of several machine learning algorithms intending to solve the automatic bug assignment problem. These four steps are respectively used to select a combination of pre-processing techniques, a bug report representation, a potential feature selection technique and to tune several classifiers. The aforementioned method has been applied on three software projects: 66 066 bug reports of a proprietary project, 24 450 bug reports of Eclipse JDT and 30 358 bug reports of Mozilla Firefox. 619 configurations have been applied and compared on each of these three projects. In production, using the approach introduced in this work on the bug reports of the proprietary project would have increased the accuracy by up to 16.64 percentage points. bug triage bug assignment bug mining bug report activity-based approach issue tracking bug repository bug tracker pre-processing feature extraction feature selection tuning model selection hyper-parameter optimization text mining text classification classifier supervised learning machine learning information retrieval bugzilla eclipse jdt mozilla firefox open source software proprietary project accuracy mean reciprocal rank software development software maintenance software engineering Computer and Information Sciences Data- och informationsvetenskap
486	Analysis and Reconstruction of the Hematopoietic Stem Cell Differentiation Tree: A Linear Programming Approach for Gene Selection Ghadie, Mohamed A. January 2015 (has links) Stem cells differentiate through an organized hierarchy of intermediate cell types to terminally differentiated cell types. This process is largely guided by master transcriptional regulators, but it also depends on the expression of many other types of genes. The discrete cell types in the differentiation hierarchy are often identified based on the expression or non-expression of certain marker genes. Historically, these have often been various cell-surface proteins, which are fairly easy to assay biochemically but are not necessarily causative of the cell type, in the sense of being master transcriptional regulators. This raises important questions about how gene expression across the whole genome controls or reflects cell state, and in particular, differentiation hierarchies. Traditional approaches to understanding gene expression patterns across multiple conditions, such as principal components analysis or K-means clustering, can group cell types based on gene expression, but they do so without knowledge of the differentiation hierarchy. Hierarchical clustering and maximization of parsimony can organize the cell types into a tree, but in general this tree is different from the differentiation hierarchy. Using hematopoietic differentiation as an example, we demonstrate how many genes other than marker genes are able to discriminate between different branches of the differentiation tree by proposing two models for detecting genes that are up-regulated or down-regulated in distinct lineages. We then propose a novel approach to solving the following problem: Given the differentiation hierarchy and gene expression data at each node, construct a weighted Euclidean distance metric such that the minimum spanning tree with respect to that metric is precisely the given differentiation hierarchy. We provide a set of linear constraints that are provably sufficient for the desired construction and a linear programming framework to identify sparse sets of weights, effectively identifying genes that are most relevant for discriminating different parts of the tree. We apply our method to microarray gene expression data describing 38 cell types in the hematopoiesis hierarchy, constructing a sparse weighted Euclidean metric that uses just 175 genes. These 175 genes are different than the marker genes that were used to identify the 38 cell types, hence offering a novel alternative way of discriminating different branches of the tree. A DAVID functional annotation analysis shows that the 175 genes reflect major processes and pathways active in different parts of the tree. However, we find that there are many alternative sets of weights that satisfy the linear constraints. Thus, in the style of random-forest training, we also construct metrics based on random subsets of the genes and compare them to the metric of 175 genes. Our results show that the 175 genes frequently appear in the random metrics, implicating their significance from an empirical point of view as well. Finally, we show how our linear programming method is able to identify columns that were selected to build minimum spanning trees on the nodes of random variable-size matrices. Linear Programming Distance Metric Learning Machine Learning Feature Selection Tree Reconstruction Hierarchical Clustering Minimum Spanning Tree Clustering Optimization Maximum Parsimony Euclidean Distance Weighted Euclidean Stem Cell Differentiation Hematopoiesis Transcriptional Regulation Transcription Factor Gene Selection Gene Expression Microarray Cell Type Marker Gene Functional Annotation Random Forest Biological Function Regulation Statistical Significance Erythropoiesis Natural Killer Cell T Cell B Cell Granulocyte Monocyte Megakaryocyte Minimization Linear Constraint Cell Lineage
487	Vývoj moderních akustických parametrů kvantifikujících hypokinetickou dysartrii / Development of modern acoustic features quantifying hypokinetic dysarthria Kowolowski, Alexander January 2019 (has links) This work deals with designing and testing of new acoustic features for analysis of dysprosodic speech occurring in hypokinetic dysarthria patients. 41 new features for dysprosody quantification (describing melody, loudness, rhythm and pace) are presented and tested in this work. New features can be divided into 7 groups. Inside the groups, features vary by the used statistical values. First four groups are based on absolute differences and cumulative sums of fundamental frequency and short-time energy of the signal. Fifth group contains features based on multiples of this fundamental frequency and short-time energy combined into one global intonation feature. Sixth group contains global time features, which are made of divisions between conventional rhythm and pace features. Last group contains global features for quantification of whole dysprosody, made of divisions between global intonation and global time features. All features were tested on Czech Parkinsonian speech database PARCZ. First, kernel density estimation was made and plotted for all features. Then correlation analysis with medicinal metadata was made, first for all the features, then for global features only. Next classification and regression analysis were made, using classification and regression trees algorithm (CART). This analysis was first made for all the features separately, then for all the data at once and eventually a sequential floating feature selection was made, to find out the best fitting combination of features for the current matter. Even though none of the features emerged as a universal best, there were a few features, that were appearing as one of the best repeatedly and also there was a trend that there was a bigger drop between the best and the second best feature, marking it as a much better feature for the given matter, than the rest of the tested. Results are included in the conclusion together with the discussion.
488	Porovnání klasifikačních metod / Comparison of Classification Methods Dočekal, Martin January 2019 (has links) This thesis deals with a comparison of classification methods. At first, these classification methods based on machine learning are described, then a classifier comparison system is designed and implemented. This thesis also describes some classification tasks and datasets on which the designed system will be tested. The evaluation of classification tasks is done according to standard metrics. In this thesis is presented design and implementation of a classifier that is based on the principle of evolutionary algorithms.
489	Atrial Fibrillation Detection Algorithm Evaluation and Implementation in Java / Utvärdering av algoritmer för detektion av förmaksflimmer samt implementation i Java Dizon, Lucas, Johansson, Martin January 2014 (has links) Atrial fibrillation is a common heart arrhythmia which is characterized by a missing or irregular contraction of the atria. The disease is a risk factor for other more serious diseases and the total medical costs in society are extensive. Therefore it would be beneficial to improve and optimize the prevention and detection of the disease. Pulse palpation and heart auscultation can facilitate the detection of atrial fibrillation clinically, but the diagnosis is generally confirmed by an ECG examination. Today there are several algorithms that detect atrial fibrillation by analysing an ECG. A common method is to study the heart rate variability (HRV) and by different types of statistical calculations find episodes of atrial fibrillation which deviates from normal sinus rhythm. Two algorithms for detection of atrial fibrillation have been evaluated in Matlab. One is based on the coefficient of variation and the other uses a logistic regression model. Training and testing of the algorithms were done with data from the Physionet MIT database. Several steps of signal processing were used to remove different types of noise and artefacts before the data could be used. When testing the algorithms, the CV algorithm performed with a sensitivity of 91,38%, a specificity of 93,93% and accuracy of 92,92%, and the results of the logistic regression algorithm was a sensitivity of 97,23%, specificity of 93,79% and accuracy of 95,39%. The logistic regression algorithm performed better and was chosen for implementation in Java, where it achieved a sensitivity of 97,31%, specificity of 93,47% and accuracy of 95,25%. / Förmaksflimmer är en vanlig hjärtrytmrubbning som kännetecknas av en avsaknad eller oregelbunden kontraktion av förmaken. Sjukdomen är en riskfaktor för andra allvarligare sjukdomar och de totala kostnaderna för samhället är betydande. Det skulle därför vara fördelaktigt att effektivisera och förbättra prevention samt diagnostisering av förmaksflimmer. Kliniskt diagnostiseras förmaksflimmer med hjälp av till exempel pulspalpation och auskultation av hjärtat, men diagnosen brukar fastställas med en EKG-undersökning. Det finns idag flertalet algoritmer för att detektera arytmin genom att analysera ett EKG. En av de vanligaste metoderna är att undersöka variabiliteten av hjärtrytmen (HRV) och utföra olika sorters statistiska beräkningar som kan upptäcka episoder av förmaksflimmer som avviker från en normal sinusrytm. I detta projekt har två metoder för att detektera förmaksflimmer utvärderats i Matlab, en baseras på beräkningar av variationskoefficienten och den andra använder sig av logistisk regression. EKG som kommer från databasen Physionet MIT används för att träna och testa modeller av algoritmerna. Innan EKG-signalen kan användas måste den behandlas för att ta bort olika typer av brus och artefakter. Vid test av algoritmen med variationskoefficienten blev resultatet en sensitivitet på 91,38%, en specificitet på 93,93% och en noggrannhet på 92,92%. För logistisk regression blev sensitiviteten 97,23%, specificiteten 93,79% och noggrannheten 95,39%. Algoritmen med logistisk regression presterade bättre och valdes därför för att implementeras i Java, där uppnåddes en sensitivitet på 91,31%, en specificitet på 93,47% och en noggrannhet på 95,25%. Atrial Fibrillation AF Detection algorithm AF detection Algorithm evaluation Matlab Java Electrocardiogram ECG Heart rate variability HRV Signal processing Pre processing Noise reduction Baseline wander powerline interference wavelet transform DWT R-peak detection Pan-tompkins HRV cleaning Feature selection U-test SDANN SDNN CV RMSSD PNN50 TINN LF HF Logistic regression CV algorithm Receiver operating characteristics ROC Classification Confusion matrix Leave-one-out cross validation Statgraphics Förmaksflimmer detekteringsalgoritm evaluering Matlab Java elektrokardiogram EKG Hjärtfrekvens variabilitet Signalbehandling Brusreducering Logistisk regression Variationskoefficient Klassificering Medical Engineering Medicinteknik

Search results