Global ETD Search

291	A General Model for Continuous Noninvasive Pulmonary Artery Pressure Estimation Smith, Robert Anthony 15 December 2011 (has links) (PDF) Elevated pulmonary artery pressure (PAP) is a significant healthcare risk. Continuous monitoring for patients with elevated PAP is crucial for effective treatment, yet the most accurate method is invasive and expensive, and cannot be performed repeatedly. Noninvasive methods exist but are inaccurate, expensive, and cannot be used for continuous monitoring. We present a machine learning model based on heart sounds that estimates pulmonary artery pressure with enough accuracy to exclude an invasive diagnostic operation, allowing for consistent monitoring of heart condition in suspect patients without the cost and risk of invasive monitoring. We conduct a greedy search through 38 possible features using a 109-patient cross-validation to find the most predictive features. Our best general model has a standard estimate of error (SEE) of 8.28 mmHg, which outperforms the previous best performance in the literature on a general set of unseen patient data. Feature Selection PAP Medical Diagnostics SVM Parameter Selection Neural Networks Neural Networks Topology Dimensionality Reduction Computer Sciences
292	Bayesian Test Analytics for Document Collections Walker, Daniel David 15 November 2012 (has links) (PDF) Modern document collections are too large to annotate and curate manually. As increasingly large amounts of data become available, historians, librarians and other scholars increasingly need to rely on automated systems to efficiently and accurately analyze the contents of their collections and to find new and interesting patterns therein. Modern techniques in Bayesian text analytics are becoming wide spread and have the potential to revolutionize the way that research is conducted. Much work has been done in the document modeling community towards this end,though most of it is focused on modern, relatively clean text data. We present research for improved modeling of document collections that may contain textual noise or that may include real-valued metadata associated with the documents. This class of documents includes many historical document collections. Indeed, our specific motivation for this work is to help improve the modeling of historical documents, which are often noisy and/or have historical context represented by metadata. Many historical documents are digitized by means of Optical Character Recognition(OCR) from document images of old and degraded original documents. Historical documents also often include associated metadata, such as timestamps,which can be incorporated in an analysis of their topical content. Many techniques, such as topic models, have been developed to automatically discover patterns of meaning in large collections of text. While these methods are useful, they can break down in the presence of OCR errors. We show the extent to which this performance breakdown occurs. The specific types of analyses covered in this dissertation are document clustering, feature selection, unsupervised and supervised topic modeling for documents with and without OCR errors and a new supervised topic model that uses Bayesian nonparametrics to improve the modeling of document metadata. We present results in each of these areas, with an emphasis on studying the effects of noise on the performance of the algorithms and on modeling the metadata associated with the documents. In this research we effectively: improve the state of the art in both document clustering and topic modeling; introduce a useful synthetic dataset for historical document researchers; and present analyses that empirically show how existing algorithms break down in the presence of OCR errors. topic modeling Bayesian nonparametrics ocr text mining text analytics document clustering clustering feature selection unsupervised learning machine learning Computer Sciences
293	[pt] AGRUPAMENTO FUZZY APLICADO À INTEGRAÇÃO DE DADOS MULTI-ÔMICOS / [en] FUZZY CLUSTERING APPLIED TO MULTI-OMICS DATA SARAH HANNAH LUCIUS LACERDA DE GOES TELLES CARVALHO ALVES 05 October 2021 (has links) [pt] Os avanços nas tecnologias de obtenção de dados multi-ômicos têm disponibilizado diferentes níveis de informação molecular que aumentam progressivamente em volume e variedade. Neste estudo, propõem-se uma metodologia de integração de dados clínicos e multi-ômicos, com o objetivo de identificar subtipos de câncer por agrupamento fuzzy, representando assim as gradações entre os diferentes perfis moleculares. Uma melhor caracterização de tumores em subtipos moleculares pode contribuir para uma medicina mais personalizada e assertiva. Os conjuntos de dados ômicos a serem integrados são definidos utilizando um classificador com classe-alvo definida por resultados da literatura. Na sequência, é realizado o pré-processamento dos conjuntos de dados para reduzir a alta dimensionalidade. Os dados selecionados são integrados e em seguida agrupados. Optou-se pelo algoritmo fuzzy C-means pela sua capacidade de considerar a possibilidade dos pacientes terem características de diferentes grupos, o que não é possível com métodos clássicos de agrupamento. Como estudo de caso, utilizou-se dados de câncer colorretal (CCR). O CCR tem a quarta maior incidência na população mundial e a terceira maior no Brasil. Foram extraídos dados de metilação, expressão de miRNA e mRNA do portal do projeto The Cancer Genome Atlas (TCGA). Observou-se que a adição dos dados de expressão de miRNA e metilação a um classificador de expressão de mRNA da literatura aumentou a acurácia deste em 5 pontos percentuais. Assim, foram usados dados de metilação, expressão de miRNA e mRNA neste trabalho. Os atributos de cada conjunto de dados foram selecionados, obtendo-se redução significativa do número de atributos. A identificação dos grupos foi realizada com o algoritmo fuzzy C-means. A variação dos hiperparâmetros deste algoritmo, número de grupos e parâmetro de fuzzificação, permitiu a escolha da combinação de melhor desempenho. A escolha da melhor configuração considerou o efeito da variação dos parâmetros nas características biológicas, em especial na sobrevida global dos pacientes. Observou-se que o agrupamento gerado permitiu identificar que as amostras consideradas não agrupadas têm características biológicas compartilhadas entre grupos de diferentes prognósticos. Os resultados obtidos com a combinação de dados clínicos e ômicos mostraram-se promissores para melhor predizer o fenótipo. / [en] The advances in technologies for obtaining multi-omic data provide different levels of molecular information that progressively increase in volume and variety. This study proposes a methodology for integrating clinical and multiomic data, which aim is the identification of cancer subtypes using fuzzy clustering algorithm, representing the different degrees between molecular profiles. A better characterization of tumors in molecular subtypes can contribute to a more personalized and assertive medicine. A classifier that uses a target class from literature results indicates which omic data sets should be integrated. Next, data sets are pre-processed to reduce high dimensionality. The selected data is integrated and then clustered. The fuzzy C-means algorithm was chosen due to its ability to consider the shared patients characteristics between different groups. As a case study, colorectal cancer (CRC) data were used. CCR has the fourth highest incidence in the world population and the third highest in Brazil. Methylation, miRNA and mRNA expression data were extracted from The Cancer Genome Atlas (TCGA) project portal. It was observed that the addition of miRNA expression and methylation data to a literature mRNA expression classifier increased its accuracy by 5 percentage points. Therefore, methylation, miRNA and mRNA expression data were used in this work. The attributes of each data set were pre-selected, obtaining a significant reduction in the number of attributes. Groups were identified using the fuzzy C-means algorithm. The variation of the hyperparameters of this algorithm, number of groups and membership degree, indicated the best performance combination. This choice considered the effect of parameters variation on biological characteristics, especially on the overall survival of patients. Clusters showed that patients considered not grouped had biological characteristics shared between groups of different prognoses. The combination of clinical and omic data to better predict the phenotype revealed promissing results. [pt] SELECAO DE ATRIBUTOS [pt] AGRUPAMENTO FUZZY [pt] INTEGRACAO DE DADOS MULTI-OMICOS [en] FEATURE SELECTION [en] FUZZY CLUSTERING [en] MULTI-OMIC DATA INTEGRATION
294	A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems Anette, Kniberg, Nokto, David January 2018 (has links) Feature selection is the process of automatically selecting important features from data. It is an essential part of machine learning, artificial intelligence, data mining, and modelling in general. There are many feature selection algorithms available and the appropriate choice can be difficult. The aim of this thesis was to compare feature selection algorithms in order to provide an experimental basis for which algorithm to choose. The first phase involved assessing which algorithms are most common in the scientific community, through a systematic literature study in the two largest reference databases: Scopus and Web of Science. The second phase involved constructing and implementing a benchmark pipeline to compare 31 algorithms’ performance on 50 data sets.The selected features were used to construct classification models and their predictive performances were compared, as well as the runtime of the selection process. The results show a small overall superiority of embedded type algorithms, especially types that involve Decision Trees. However, there is no algorithm that is significantly superior in every case. The pipeline and data from the experiments can be used by practitioners in determining which algorithms to apply to their respective problems. / Variabelselektion är en process där relevanta variabler automatiskt selekteras i data. Det är en essentiell del av maskininlärning, artificiell intelligens, datautvinning och modellering i allmänhet. Den stora mängden variabelselektionsalgoritmer kan göra det svårt att avgöra vilken algoritm som ska användas. Målet med detta examensarbete är att jämföra variabelselektionsalgoritmer för att ge en experimentell bas för valet av algoritm. I första fasen avgjordes vilka algoritmer som är mest förekommande i vetenskapen, via en systematisk litteraturstudie i de två största referensdatabaserna: Scopus och Web of Science. Den andra fasen bestod av att konstruera och implementera en experimentell mjukvara för att jämföra algoritmernas prestanda på 50 data set. De valda variablerna användes för att konstruera klassificeringsmodeller vars prediktiva prestanda, samt selektionsprocessens körningstid, jämfördes. Resultatet visar att inbäddade algoritmer i viss grad är överlägsna, framför allt typer som bygger på beslutsträd. Det finns dock ingen algoritm som är signifikant överlägsen i varje sammanhang. Programmet och datan från experimenten kan användas av utövare för att avgöra vilken algoritm som bör appliceras på deras respektive problem. feature selection variable selection attribute selection machine learning data mining benchmark classification variabelselektion maskininlärning datautvinning klassificering Medical Engineering Medicinteknik
295	Feature Extraction and FeatureSelection for Object-based LandCover Classification : Optimisation of Support Vector Machines in aCloud Computing Environment Stromann, Oliver January 2018 (has links) Mapping the Earth’s surface and its rapid changes with remotely sensed data is a crucial tool to un-derstand the impact of an increasingly urban world population on the environment. However, the impressive amount of freely available Copernicus data is only marginally exploited in common clas-sifications. One of the reasons is that measuring the properties of training samples, the so-called ‘fea-tures’, is costly and tedious. Furthermore, handling large feature sets is not easy in most image clas-sification software. This often leads to the manual choice of few, allegedly promising features. In this Master’s thesis degree project, I use the computational power of Google Earth Engine and Google Cloud Platform to generate an oversized feature set in which I explore feature importance and analyse the influence of dimensionality reduction methods. I use Support Vector Machines (SVMs) for object-based classification of satellite images - a commonly used method. A large feature set is evaluated to find the most relevant features to discriminate the classes and thereby contribute most to high clas-sification accuracy. In doing so, one can bypass the sensitive knowledge-based but sometimes arbi-trary selection of input features.Two kinds of dimensionality reduction methods are investigated. The feature extraction methods, Linear Discriminant Analysis (LDA) and Independent Component Analysis (ICA), which transform the original feature space into a projected space of lower dimensionality. And the filter-based feature selection methods, chi-squared test, mutual information and Fisher-criterion, which rank and filter the features according to a chosen statistic. I compare these methods against the default SVM in terms of classification accuracy and computational performance. The classification accuracy is measured in overall accuracy, prediction stability, inter-rater agreement and the sensitivity to training set sizes. The computational performance is measured in the decrease in training and prediction times and the compression factor of the input data. I conclude on the best performing classifier with the most effec-tive feature set based on this analysis.In a case study of mapping urban land cover in Stockholm, Sweden, based on multitemporal stacks of Sentinel-1 and Sentinel-2 imagery, I demonstrate the integration of Google Earth Engine and Google Cloud Platform for an optimised supervised land cover classification. I use dimensionality reduction methods provided in the open source scikit-learn library and show how they can improve classification accuracy and reduce the data load. At the same time, this project gives an indication of how the exploitation of big earth observation data can be approached in a cloud computing environ-ment.The preliminary results highlighted the effectiveness and necessity of dimensionality reduction methods but also strengthened the need for inter-comparable object-based land cover classification benchmarks to fully assess the quality of the derived products. To facilitate this need and encourage further research, I plan to publish the datasets (i.e. imagery, training and test data) and provide access to the developed Google Earth Engine and Python scripts as Free and Open Source Software (FOSS). / Kartläggning av jordens yta och dess snabba förändringar med fjärranalyserad data är ett viktigt verktyg för att förstå effekterna av en alltmer urban världsbefolkning har på miljön. Den imponerande mängden jordobservationsdata som är fritt och öppet tillgänglig idag utnyttjas dock endast marginellt i klassifikationer. Att hantera ett set av många variabler är inte lätt i standardprogram för bildklassificering. Detta leder ofta till manuellt val av få, antagligen lovande variabler. I det här arbetet använde jag Google Earth Engines och Google Cloud Platforms beräkningsstyrkan för att skapa ett överdimensionerat set av variabler i vilket jag undersöker variablernas betydelse och analyserar påverkan av dimensionsreducering. Jag använde stödvektormaskiner (SVM) för objektbaserad klassificering av segmenterade satellitbilder – en vanlig metod inom fjärranalys. Ett stort antal variabler utvärderas för att hitta de viktigaste och mest relevanta för att diskriminera klasserna och vilka därigenom mest bidrar till klassifikationens exakthet. Genom detta slipper man det känsliga kunskapsbaserade men ibland godtyckliga urvalet av variabler.Två typer av dimensionsreduceringsmetoder tillämpades. Å ena sidan är det extraktionsmetoder, Linjär diskriminantanalys (LDA) och oberoende komponentanalys (ICA), som omvandlar de ursprungliga variablers rum till ett projicerat rum med färre dimensioner. Å andra sidan är det filterbaserade selektionsmetoder, chi-två-test, ömsesidig information och Fisher-kriterium, som rangordnar och filtrerar variablerna enligt deras förmåga att diskriminera klasserna. Jag utvärderade dessa metoder mot standard SVM när det gäller exakthet och beräkningsmässiga prestanda.I en fallstudie av en marktäckeskarta över Stockholm, baserat på Sentinel-1 och Sentinel-2-bilder, demonstrerade jag integrationen av Google Earth Engine och Google Cloud Platform för en optimerad övervakad marktäckesklassifikation. Jag använde dimensionsreduceringsmetoder som tillhandahålls i open source scikit-learn-biblioteket och visade hur de kan förbättra klassificeringsexaktheten och minska databelastningen. Samtidigt gav detta projekt en indikation på hur utnyttjandet av stora jordobservationsdata kan nås i en molntjänstmiljö.Resultaten visar att dimensionsreducering är effektiv och nödvändig. Men resultaten stärker också behovet av ett jämförbart riktmärke för objektbaserad klassificering av marktäcket för att fullständigt och självständigt bedöma kvaliteten på de härledda produkterna. Som ett första steg för att möta detta behov och för att uppmuntra till ytterligare forskning publicerade jag dataseten och ger tillgång till källkoderna i Google Earth Engine och Python-skript som jag utvecklade i denna avhandling. Feature Extraction Feature Selection Dimensionality Reduction Google Earth Engine Remote Sensing Fjärranalysteknik
296	Quantum Algorithms for Feature Selection and Compressed Feature Representation of Data / Kvantalgoritmer för Funktionsval och Datakompression Laius Lundgren, William January 2023 (has links) Quantum computing has emerged as a new field that may have the potential to revolutionize the landscape of information processing and computational power, although physically constructing quantum hardware has proven difficult,and quantum computers in the current Noisy Intermediate Scale Quantum (NISQ) era are error prone and limited in the number of qubits they contain.A sub-field within quantum algorithms research which holds potential for the NISQ era, and which has seen increasing activity in recent years, is quantum machine learning, where researchers apply approaches from classical machine learning to quantum computing algorithms and explore the interplay between the two. This master thesis investigates feature selection and autoencoding algorithms for quantum computers. Our review of the prior art led us to focus on contributing to three sub-problems: A) Embedded feature selection on quantum annealers, B) short depth quantum autoencoder circuits, and C)embedded compressed feature representation for quantum classifier circuits.For problem A, we demonstrate a working example by converting ridge regression to the Quadratic Unconstrained Binary Optimization (QUBO) problem formalism native to quantum annealers, and solving it on a simulated backend. For problem B we develop a novel quantum convolutional autoencoder architecture and successfully run simulation experiments to study its performance.For problem C, we choose a classifier quantum circuit ansatz based on theoretical considerations from the prior art, and experimentally study it in parallel with a classical benchmark method for the same classification task,then show a method from embedding compressed feature representation onto that quantum circuit. / Kvantberäkning är ett framväxande område som potentiellt kan revolutionera informationsbehandling och beräkningskraft. Dock är praktisk konstruktion av kvantdatorer svårt, och nuvarande kvantdatorer i den s.k. NISQ-eran lider av fel och begränsningar i antal kvantbitar de kan hantera. Ett lovande delområde inom kvantalgoritmer är kvantmaskininlärning, där forskare tillämpar klassiska maskininlärningsmetoder på kvantalgoritmer och utforskar samspelet mellande två områdena.. Denna avhandling fokuserar på kvantalgoritmer för funktionsval,och datakompression (i form av s.k. “autoencoders”). Vi undersöker tre delproblem: A) Inbäddat funktionsval på en kvantannealer, B) autoencoder-kvantkretsar för datakompression, och C) inbyggt funktionsval för kvantkretsar för klassificering. För problem A demonstrerar vi ett fungerande exempel genom att omvandla ridge regression till problemformuleringen "Quadratic Unconstrained Binary Optimization"(QUBO) som är nativ för kvantannealers,och löser det på en simulerad backend. För problem B utvecklar vi en ny konvolutionerande autoencoder-kvantkrets-arkitektur och utför simuleringsexperimentför att studera dess prestanda. För problem C väljer vi en kvantkrets-ansats för klassificering baserad på teoretiska överväganden från tidigare forskning och studerar den experimentellt parallellt med en klassisk benchmark-metod församma klassificeringsuppgift, samt visar en metod för inbyggt funktionsval (i form av datakompression) i denna kvantkrets. Feature selection autoencoders quantum machine learning quantum circuits quantum annealing Funktionsval datakompression kvantmaskininlärning kvantalgoritmer kvantkretsar Physical Sciences Fysik
297	Biomarker Identification for Breast Cancer Types Using Feature Selection and Explainable AI Methods La Rosa Giraud, David E 01 January 2023 (has links) (PDF) This paper investigates the impact the LASSO, mRMR, SHAP, and Reinforcement Feature Selection techniques on random forest models for the breast cancer subtypes markers ER, HER2, PR, and TN as well as identifying a small subset of biomarkers that could potentially cause the disease and explain them using explainable AI techniques. This is important because in areas such as healthcare understanding why the model makes a specific decision is important it is a diagnostic of an individual which requires reliable AI. Another contribution is using feature selection methods to identify a small subset of biomarkers capable of predicting if a specific RNA sequence will have one of the cancer labels positive. The study begins by obtaining baseline accuracy metric using a random forest model on The Cancer Genome Atlas's breast cancer database to then explore the effects of feature selection, selecting different numbers of features, significantly influencing model accuracy, and selecting a small number of potential biomarkers that may produce a specific type of breast cancer. Once the biomarkers were selected, the explainable AI techniques SHAP and LIME were applied to the models and provided insight into influential biomarkers and their impact on predictions. The main results are that there are some shared biomarkers between some of the subsets that had high influence over the model prediction, LASSO and Reinforcement Feature selection sets scoring the highest accuracy of all sets and obtaining some insight into how the models used the features by using existing explainable AI methods SHAP and LIME to understand how these selected features are affecting the model's prediction. Machine Learning TCGA Explainable AI Biomarkers Breast Cancer Subtypes Feature Selection Artificial Intelligence and Robotics Computer Sciences Genetics
298	Regularization: Stagewise Regression and Bagging Ehrlinger, John M. 31 March 2011 (has links) No description available. Statistics regularization linear regression machine learning feature selection LARS lasso gradient descent regularized stagewise bagging out-of-bag
299	The Effects of Novel Feature Vectors on Metagenomic Classification Plis, Kevin A. 24 September 2014 (has links) No description available. Computer Science Bioinformatics Artificial Intelligence Metagenomics Classification Machine Learning SVM Support Vector Machine Feature Vector Feature Selection Bioinformatics
300	Activity Recogniton Using Accelerometer and Gyroscope Data From Pocket-Worn Smartphones Söderberg, Oskar, Blommegård, Oscar January 2021 (has links) Human Activity Recognition (HAR) is a widelyresearched field that has gained importance due to recentadvancements in sensor technology and machine learning. InHAR, sensors are used to identify the activity that a person is performing.In this project, the six everyday life activities walking,biking, sitting, standing, ascending stairs and descending stairsare classified using smartphone accelerometer and gyroscope datacollected by three subjects in their everyday life. To performthe classification, two different machine learning algorithms,Artificial Neural Network (ANN) and Support Vector Machine(SVM) are implemented and compared. Moreover, we comparethe accuracy of the two sensors, both individually and combined.Our results show that the accuracy is higher using only theaccelerometer data compared to using only the gyroscope data.For the accelerometer data, the accuracy is greater than 95%for both algorithms and only between 83-93% using gyroscopedata. Also, there is a small synergy effect when using both sensors,yielding higher accuracy than for any individual sensor data, andreaching 98.5% using ANN. Furthermore, for all sensor types, theANN outperforms the SVM algorithm, having a greater accuracyby more than 1.5-9 percentage points. / Aktivitetsigenkänning är ett noga studeratforskningsområde som växt i popularitet på senare tid på grundav nya framsteg inom sensorteknologi and maskininlärning. Inomaktivitetsigenkänning använder man sensorer för att identifieravilken aktivitet en person utför. I det här projektet undersökervi de sex olika vardagsmotionsaktiviteterna gå, cykla, sitta, stå och gå i trappor (up/ner) med hjälp av data från accelerometeroch gyroskop i en smartphone som samlats in av tre olikapersoner. Två olika maskininlärningsalgoritmer implementerasoch jämförs: Artificial Neural Network (ANN) och SupportVector Machine (SVM). Vidare jämför vi noggranheten förde två sensorna, både individuellt och gemensamt. Våra resultvisar att noggranheten är större när enbart accelerometerdatananvänds jämfört med att använda enbart gyroskopdatan. Föraccelerometerdatan erhålls en noggranhet större än 95 % förbåda algoritmerna medan den siffran bara är mellan 83-93 %för gyroskopdatan. Dessutom existerar det en synergieffekt vidanvändande av båda sensorerna, och noggranheten når då 98.5% vid användande av ANN. Vidare visar våra resultat att ANNhar en noggranhet som är 1.5-9 procentenheter bättre än SVMför alla sensorer. / Kandidatexjobb i elektroteknik 2021, KTH, Stockholm Elektroteknik och elektronik

Search results