Global ETD Search

431	Využití pokročilých statistických metod pro zpracování obrazu fluorescenční emise rostlin ovlivněných lokálním biotickým stresem / Utilization of advanced statistical methods for processing of florescence emission of plants affected by local biotic stress MATOUŠ, Karel January 2008 (has links) Chlorophyll fluorescence imaging is noninvasive technique often used in plant physiology, molecular biology and precision farming. Captured sequences of images record the dynamic of chlorophyll fluorescence emission which contain the information about spatial and time changes of photosynthetic activity of plant. The goal of this Ph.D. thesis is to contribute to the development of chlorophyll fluorescence imaging by application of advanced statistical techniques. Methods of statistical pattern recognition allow to identify images in the captured sequence that are reach for information about observed biotic stress and to find small subsets of fluorescence images suitable for following analysis. I utilized only methods for identification of small sets of images providing high performance with realistic time consumptions.
432	Stochastic density ratio estimation and its application to feature selection / Estimação estocástica da razão de densidades e sua aplicação em seleção de atributos Ígor Assis Braga 23 October 2014 (has links) The estimation of the ratio of two probability densities is an important statistical tool in supervised machine learning. In this work, we introduce new methods of density ratio estimation based on the solution of a multidimensional integral equation involving cumulative distribution functions. The resulting methods use the novel V -matrix, a concept that does not appear in previous density ratio estimation methods. Experiments demonstrate the good potential of this new approach against previous methods. Mutual Information - MI - estimation is a key component in feature selection and essentially depends on density ratio estimation. Using one of the methods of density ratio estimation proposed in this work, we derive a new estimator - VMI - and compare it experimentally to previously proposed MI estimators. Experiments conducted solely on mutual information estimation show that VMI compares favorably to previous estimators. Experiments applying MI estimation to feature selection in classification tasks evidence that better MI estimation leads to better feature selection performance. Parameter selection greatly impacts the classification accuracy of the kernel-based Support Vector Machines - SVM. However, this step is often overlooked in experimental comparisons, for it is time consuming and requires familiarity with the inner workings of SVM. In this work, we propose procedures for SVM parameter selection which are economic in their running time. In addition, we propose the use of a non-linear kernel function - the min kernel - that can be applied to both low- and high-dimensional cases without adding another parameter to the selection process. The combination of the proposed parameter selection procedures and the min kernel yields a convenient way of economically extracting good classification performance from SVM. The Regularized Least Squares - RLS - regression method is another kernel method that depends on proper selection of its parameters. When training data is scarce, traditional parameter selection often leads to poor regression estimation. In order to mitigate this issue, we explore a kernel that is less susceptible to overfitting - the additive INK-splines kernel. Then, we consider alternative parameter selection methods to cross-validation that have been shown to perform well for other regression methods. Experiments conducted on real-world datasets show that the additive INK-splines kernel outperforms both the RBF and the previously proposed multiplicative INK-splines kernel. They also show that the alternative parameter selection procedures fail to consistently improve performance. Still, we find that the Finite Prediction Error method with the additive INK-splines kernel performs comparably to cross-validation. / A estimação da razão entre duas densidades de probabilidade é uma importante ferramenta no aprendizado de máquina supervisionado. Neste trabalho, novos métodos de estimação da razão de densidades são propostos baseados na solução de uma equação integral multidimensional. Os métodos resultantes usam o conceito de matriz-V , o qual não aparece em métodos anteriores de estimação da razão de densidades. Experimentos demonstram o bom potencial da nova abordagem com relação a métodos anteriores. A estimação da Informação Mútua - IM - é um componente importante em seleção de atributos e depende essencialmente da estimação da razão de densidades. Usando o método de estimação da razão de densidades proposto neste trabalho, um novo estimador - VMI - é proposto e comparado experimentalmente a estimadores de IM anteriores. Experimentos conduzidos na estimação de IM mostram que VMI atinge melhor desempenho na estimação do que métodos anteriores. Experimentos que aplicam estimação de IM em seleção de atributos para classificação evidenciam que uma melhor estimação de IM leva as melhorias na seleção de atributos. A tarefa de seleção de parâmetros impacta fortemente o classificador baseado em kernel Support Vector Machines - SVM. Contudo, esse passo é frequentemente deixado de lado em avaliações experimentais, pois costuma consumir tempo computacional e requerer familiaridade com as engrenagens de SVM. Neste trabalho, procedimentos de seleção de parâmetros para SVM são propostos de tal forma a serem econômicos em gasto de tempo computacional. Além disso, o uso de um kernel não linear - o chamado kernel min - é proposto de tal forma que possa ser aplicado a casos de baixa e alta dimensionalidade e sem adicionar um outro parâmetro a ser selecionado. A combinação dos procedimentos de seleção de parâmetros propostos com o kernel min produz uma maneira conveniente de se extrair economicamente um classificador SVM com boa performance. O método de regressão Regularized Least Squares - RLS - é um outro método baseado em kernel que depende de uma seleção de parâmetros adequada. Quando dados de treinamento são escassos, uma seleção de parâmetros tradicional em RLS frequentemente leva a uma estimação ruim da função de regressão. Para aliviar esse problema, é explorado neste trabalho um kernel menos suscetível a superajuste - o kernel INK-splines aditivo. Após, são explorados métodos de seleção de parâmetros alternativos à validação cruzada e que obtiveram bom desempenho em outros métodos de regressão. Experimentos conduzidos em conjuntos de dados reais mostram que o kernel INK-splines aditivo tem desempenho superior ao kernel RBF e ao kernel INK-splines multiplicativo previamente proposto. Os experimentos também mostram que os procedimentos alternativos de seleção de parâmetros considerados não melhoram consistentemente o desempenho. Ainda assim, o método Finite Prediction Error com o kernel INK-splines aditivo possui desempenho comparável à validação cruzada. Estimação da informação mútua Estimação da razão de densidades Seleção de atributos Seleção de parâmetros em RLS Seleção de parâmetros em SVM Density ratio estimation Feature selection Mutual information estimation RLS parameter selection SVM parameter selection
433	Methods for automation of vascular lesions detection in computed tomography images / Méthodes d'automatisation de la détection des lésions vasculaires dans des images de tomodensitométrie Zuluaga Valencia, Maria Alejandra 12 January 2011 (has links) Les travaux de cette thèse sont consacrés à la détection et le diagnostic des lésions vasculaires, particulièrement dans le cas la maladie coronaire. La maladie coronaire continue à être la première cause de mortalité dans les pays industrialisés. En général, l'identification des lésions vasculaires est abordée en essayant de modéliser les anormalités (lésions). Le principal inconvénient de cette approche est que les lésions sont très hétérogènes, ce qui rend difficile la détection de nouvelles lésions qui n'ont pas été prises en compte par le modèle. Dans cette thèse, nous proposons de ne pas modéliser directement les lésions, mais de supposer que les lésions sont des événements anormaux qui se manifestent comme points avec une faible densité de probabilité. Nous proposons l'utilisation de deux méthodes de classification basées sur les Machines à Vecteurs de Support (SVM) pour résoudre le problème de détection du niveau de densité. Le principal avantage de ces deux méthodes est que la phase d'apprentissage ne requiert pas de données étiquetées représentant les lésions. La première méthode est complètement non supervisée, alors que la seconde exige des étiquettes seulement pour les cas qu'on appelle sains ou normaux. L'utilisation des algorithmes de classification sélectionnés nécessite des descripteurs tels que les anomalies soient représentées comme des points avec une densité de probabilité faible. A cette fin, nous avons développé une métrique basée sur l'intensité de l'image, que nous avons appelée concentric rings. Cette métrique est sensible à la quasi-symétrie des profils d'intensité des vaisseaux sains, mais aussi aux écarts par rapport à cette symétrie, observés dans des cas pathologiques. De plus, nous avons sélectionné plusieurs autres descripteurs candidats à utiliser comme entrée pour les classifieurs. Des expériences sur des données synthétiques et des données de CT cardiaques démontrent que notre métrique a une bonne performance dans la détection d'anomalies, lorsqu'elle est utilisée avec les classifeurs retenus. Une combinaison de plusieurs descripteurs candidats avec la métrique concentric rings peut améliorer la performance de la détection. Nous avons défini un schéma non supervisé de sélection de descripteurs qui permet de déterminer un sous-ensemble optimal de descripteurs. Nous avons confronté les résultats de détection réalisée en utilisant le sous-ensemble de descripteurs sélectionné par notre méthode avec les performances obtenues avec des sous-ensembles sélectionnés par des méthodes supervisées existantes. Ces expériences montrent qu'une combinaison de descripteurs bien choisis améliore effectivement les performances des classifieurs et que les meilleurs résultats s'obtiennent avec le sous-ensemble sélectionné par notre méthode, en association avec les algorithmes de détection retenus. Finalement, nous proposons de réaliser un recalage local entre deux images représentant différentes phases du cycle cardiaque, afin de confronter les résultats de détection dans ces images (phases). L'objectif ici est non seulement d'attirer l'attention du praticien sur les anomalies détectées comme lésions potentielles, mais aussi de l'aider à conforter son diagnostic en visualisant automatiquement la même région reconstruite à différents instants du cycle cardiaque / This thesis presents a framework for the detection and diagnosis of vascular lesions with a special emphasis on coronary heart disease. Coronary heart disease remains to be the first cause of mortality worldwide. Typically, the problem of vascular lesion identification has been solved by trying to model the abnormalities (lesions). The main drawback of this approach is that lesions are highly heterogeneous, which makes the detection of previously unseen abnormalities difficult. We have selected not to model lesions directly, but to treat them as anomalies which are seen as low probability density points. We propose the use of two classification frameworks based on support vector machines (SVM) for the density level detection problem. The main advantage of these two methods is that the learning stage does not require labeled data representing lesions, which is always difficult to obtain. The first method is completely unsupervised, whereas the second one only requires a limited number of labels for normality. The use of these anomaly detection algorithms requires the use of features such that anomalies are represented as points with low probability density. For this purpose, we developed an intensity based metric, denoted concentric rings, designed to capture the nearly symmetric intensity profiles of healthy vessels, as well as discrepancies with respect to the normal behavior. Moreover, we have selected a large set of alternative candidate features to use as input for the classifiers. Experiments on synthetic data and cardiac CT data demonstrated that our metric has a good performance in the detection of anomalies, when used with the selected classifiers. Combination of other features with the concentric rings metric has potential to improve the classification performance. We defined an unsupervised feature selection scheme that allows the definition of an optimal subset of features. We compared it with existent supervised feature selection methods. These experiments showed that, in general, the combination of features improves the classifiers performance, and that the best results are achieved with the combination selected by our scheme, associated with the proposed anomaly detection algorithms. Finally, we propose to use image registration in order to compare the classification results at different cardiac phases. The objective here is to match the regions detected as anomalous in different time-frames. In this way, more than attract the physician's attention to the anomaly detected as potential lesion, we want to aid in validating the diagnosis by automatically displaying the same suspected region reconstructed in different time-frames Imagerie médicale Diagnostic assisté par ordinateur Appareil d'étude Sélection de descripteurs Tomographie Maladie vasculaire Athérosclérose Medical Imaging Computer Assisted Diagnosis Machine Learning Feature Selection Computed Tomography Vascular disease Atherosclerosis 616.1
434	Algorithmes de poursuite stochastiques et inégalités de concentration empiriques pour l'apprentissage statistique / Stochastic pursuit algorithms and empirical concentration inequalities for machine learning Peel, Thomas 29 November 2013 (has links) La première partie de cette thèse introduit de nouveaux algorithmes de décomposition parcimonieuse de signaux. Basés sur Matching Pursuit (MP) ils répondent au problème suivant : comment réduire le temps de calcul de l'étape de sélection de MP, souvent très coûteuse. En réponse, nous sous-échantillonnons le dictionnaire à chaque itération, en lignes et en colonnes. Nous montrons que cette approche fondée théoriquement affiche de bons résultats en pratique. Nous proposons ensuite un algorithme itératif de descente de gradient par blocs de coordonnées pour sélectionner des caractéristiques en classification multi-classes. Celui-ci s'appuie sur l'utilisation de codes correcteurs d'erreurs transformant le problème en un problème de représentation parcimonieuse simultanée de signaux. La deuxième partie expose de nouvelles inégalités de concentration empiriques de type Bernstein. En premier, elles concernent la théorie des U-statistiques et sont utilisées pour élaborer des bornes en généralisation dans le cadre d'algorithmes de ranking. Ces bornes tirent parti d'un estimateur de variance pour lequel nous proposons un algorithme de calcul efficace. Ensuite, nous présentons une version empirique de l'inégalité de type Bernstein proposée par Freedman [1975] pour les martingales. Ici encore, la force de notre borne réside dans l'introduction d'un estimateur de variance calculable à partir des données. Cela nous permet de proposer des bornes en généralisation pour l'ensemble des algorithmes d'apprentissage en ligne améliorant l'état de l'art et ouvrant la porte à une nouvelle famille d'algorithmes d'apprentissage tirant parti de cette information empirique. / The first part of this thesis introduces new algorithms for the sparse encoding of signals. Based on Matching Pursuit (MP) they focus on the following problem : how to reduce the computation time of the selection step of MP. As an answer, we sub-sample the dictionary in line and column at each iteration. We show that this theoretically grounded approach has good empirical performances. We then propose a bloc coordinate gradient descent algorithm for feature selection problems in the multiclass classification setting. Thanks to the use of error-correcting output codes, this task can be seen as a simultaneous sparse encoding of signals problem. The second part exposes new empirical Bernstein inequalities. Firstly, they concern the theory of the U-Statistics and are applied in order to design generalization bounds for ranking algorithms. These bounds take advantage of a variance estimator and we propose an efficient algorithm to compute it. Then, we present an empirical version of the Bernstein type inequality for martingales by Freedman [1975]. Again, the strength of our result lies in the variance estimator computable from the data. This allows us to propose generalization bounds for online learning algorithms which improve the state of the art and pave the way to a new family of learning algorithms taking advantage of this empirical information. Matching Pursuit Algorithmes Stochastiques Sélection de Caractéristiques Classification Multi-Classes Inégalités de Bernstein Empiriques U-Statistiques Martingales Ranking Apprentissage en Ligne Bornes d'Erreur en Généralisation Matching Pursuit Stochastic Algorithms Feature Selection Multiclass Classification Empirical Bernstein Inequalities U-Statistics Martingales Ranking Online Learning Generalization Bounds
435	Representação de sistemas biológicos a partir de sistemas dinâmicos: controle da transcrição a partir do estrógeno. / Representation of Biological Systems from Dynamical Systems: Transcription Control from Estrogen Marcelo Ris 14 April 2008 (has links) Esta pesquisa de doutorado apresenta resultados em três áreas distintas: (i) Ciência da Computação e Estatística -- devido ao desenvolvimento de uma nova solução para o problema de seleção de características, um problema conhecido em Reconhecimento de Padrões; (ii) Bioinformática -- em razão da construção de um método baseado em um \\textit de algoritmos, incluindo o de seleção de características, visando abordar o problema de identificação de arquiteturas de redes de expressão gênica; e (iii) Biologia -- ao relacionar o estrógeno com uma nova função biológica, após analisar informações extraídas de séries temporais de \\textit pelas novas ferramentas computacionais-estatísticas desenvolvidas. O estrógeno possui um importante papel nos tecidos reprodutivos. O crescimento das gândulas mamárias e do endométrio durante a gravidez e o ciclo menstrual são estrógeno dependentes. O crescimento das células tumorais nesses órgãos podem ser estimuladas pela simples presença de estrógeno; mais de $300$ genes são conhecidos por terem regulação positiva ou negativa devido a sua presença. A motivação inicial desta pesquisa foi a construção de um método que possa servir de ferramenta para a identificação de genes que tenham seu nível de expressão alterado a partir de uma resposta induzida por estrógeno, mais precisamente, um método para modelar os inter-relacionamentos entre os diversos genes dependentes do estrógeno. Apresentamos um novo \\textit de algoritmos que, a partir de dados temporais de \\textit e um conjunto inicial de genes que compartilham algumas características comuns, denominados de \\textit{genes sementes}, devolve como saída a arquitetura de uma rede gênica representada por um grafo dirigido. Para cada nó da rede, uma tabela de predição do gene representado pelo nó em função dos seus genes preditores (genes que apontam para ele) pode ser obtida. O método foi aplicado em estudo de série-temporal de \\textit para uma cultura de células \\textit submetidas a tratamento com estrógeno, e uma possível rede de regulação foi obtida. Encontrar o melhor subconjunto preditor de genes para um dado gene pode ser estudado como um problema de seleção de características, no qual o espaço de busca pode ser representado por um reticulado Booleano e cada um de seus elementos representa um subconjunto candidato. Uma característica importante desse problema é o fato de que para cada elemento existe uma função custo associada, e esta possui forma de curva em U para qualquer cadeia maximal do reticulado. Para esse problema, apresentamos um nova solução, o algoritmo ewindex. Esse algoritmo é um método do tipo \\textit, o qual utiliza a estrutura do reticulado Booleano e a característica de curva em U da função custo para explorar um subconjunto do espaço de busca equivalente à busca completa. Nosso método obteve excelentes resultados em eficiência e valores quando comparado com as heurísticas mais utilizadas (SFFS e SFS). A partir de um método baseado no \\textit e de um conjunto inicial de genes regulados \\textit pelo estrógeno, identificamos uma evidência de envolvimento do estrógeno em um processo biológico ainda não relacionado: a adesão celular. Esse resultado pode direcionar os estudos sobre estrógeno e câncer à investigação de processo metastático, o qual é influenciado por genes relacionados à adesão celular. / This Phd. research presents in three distinct areas: (i) Computer Science and Statistics -- on the development of a new solution for the feature selection problem which is an important problem in Pattern Recognition; (ii) Bioinformatics -- for the construction of a pipeline of algorithms, including the feature selection solution, to address the problem of identification the architecture of a genetic expression network and; (iii) Biology -- relating estrogen to a new biological function, from the results obtained by the new computational-statistic tools developed and applied to a time-series microarray data. Estrogen has an important role in reproductive tissues. The growth mammary glands and endometrial growing during menstrual cycle and pregnancy are estrogen dependent. The growth of tumor cells in those organs can be stimulated by the simple presence of estrogen. Over $300$ genes are known by their positive or negative regulation by estrogen. The initial motivation of this research was the construction of a method that can serve as a tool for the identification of genes that have changed their level of expression changed by a response induced by estrogen, more specifically, a method to model the inter-relationships between the several genes dependent on estrogen. We present a new pipeline of algorithms that from the data of a time-series microarray experiment and from an initial set of genes that share some common characteristics, known as \\textit{seed genes}, gives as an output an architecture of the genetic expression network represented by a directed graph. For each node of the network, a prediction table of the gene, represented by the node, in function of its predictors genes (genes that link to it) can be obtained. The method was applied in a study of time-series microarray for a cell line \\textit submitted to a estrogen treatment and a possible regulation network was obtained. Finding the best predictor subset of genes for a given gene can be studied as a problem of feature selection where the search space can be represented by a Boolean lattice and each one of its elements represents a possible subset. An important characteristic of this problem is: for each element in the lattice there is a cost function associated to it and this function has a U-shape in any maximal chain of the search space. For this problem we present a new solution, the \\textit algorithm. This algorithm is a branch-and-bound solution which uses the structure of the Boolean lattice and U-shaped curves to explore a subset of the search space that is equivalent to the full search. Our method obtained excellent results in performance and values when compared with the most commonly used heuristics (SFFS and SFS). From a method based on the pipeline of algorithms and from an initial set of genes direct regulated by estrogen, we identified an evidence of involvement of estrogen in a biological process not yet related to estrogen: the cell adhesion. This result can guide studies on estrogen and cancer to research in metastatic process, which is affected by cell adhesion related genes. Adesão Celular Bioinformática Estrógeno Modelagem biológica Reconhecimento de Padrões Seleção de Características Sistemas Biológicos Sistemas Dinâmicos U-curve Bioinformatics Biological Modelling Biological Systems Celular Adhesion Dynamical Systems Estrogen Feature Selection Pattern Recognition U-curve
436	Redes complexas de expressão gênica: síntese, identificação, análise e aplicações / Gene expression complex networks: synthesis, identification, analysis and applications Fabricio Martins Lopes 21 February 2011 (has links) Os avanços na pesquisa em biologia molecular e bioquímica permitiram o desenvolvimento de técnicas capazes de extrair informações moleculares de milhares de genes simultaneamente, como DNA Microarrays, SAGE e, mais recentemente RNA-Seq, gerando um volume massivo de dados biológicos. O mapeamento dos níveis de transcrição dos genes em larga escala é motivado pela proposição de que o estado funcional de um organismo é amplamente determinado pela expressão de seus genes. No entanto, o grande desafio enfrentado é o pequeno número de amostras (experimentos) com enorme dimensionalidade (genes). Dessa forma, se faz necessário o desenvolvimento de novas técnicas computacionais e estatísticas que reduzam o erro de estimação intrínseco cometido na presença de um pequeno número de amostras com enorme dimensionalidade. Neste contexto, um foco importante de pesquisa é a modelagem e identificação de redes de regulação gênica (GRNs) a partir desses dados de expressão. O objetivo central nesta pesquisa é inferir como os genes estão regulados, trazendo conhecimento sobre as interações moleculares e atividades metabólicas de um organismo. Tal conhecimento é fundamental para muitas aplicações, tais como o tratamento de doenças, estratégias de intervenção terapêutica e criação de novas drogas, bem como para o planejamento de novos experimentos. Nessa direção, este trabalho apresenta algumas contribuições: (1) software de seleção de características; (2) nova abordagem para a geração de Redes Gênicas Artificiais (AGNs); (3) função critério baseada na entropia de Tsallis; (4) estratégias alternativas de busca para a inferência de GRNs: SFFS-MR e SFFS-BA; (5) investigação biológica das redes gênicas envolvidas na biossíntese de tiamina, usando a Arabidopsis thaliana como planta modelo. O software de seleção de características consiste de um ambiente de código livre, gráfico e multiplataforma para problemas de bioinformática, que disponibiliza alguns algoritmos de seleção de características, funções critério e ferramentas de visualização gráfica. Em particular, implementa um método de inferência de GRNs baseado em seleção de características. Embora existam vários métodos propostos na literatura para a modelagem e identificação de GRNs, ainda há um problema muito importante em aberto: como validar as redes identificadas por esses métodos computacionais? Este trabalho apresenta uma nova abordagem para validação de tais algoritmos, considerando três aspectos principais: (a) Modelo para geração de Redes Gênicas Artificiais (AGNs), baseada em modelos teóricos de redes complexas, os quais são usados para simular perfis temporais de expressão gênica; (b) Método computacional para identificação de redes gênicas a partir de dados temporais de expressão; e (c) Validação das redes identificadas por meio do modelo AGN. O desenvolvimento do modelo AGN permitiu a análise e investigação das características de métodos de inferência de GRNs, levando ao desenvolvimento de um estudo comparativo entre quatro métodos disponíveis na literatura. A avaliação dos métodos de inferência levou ao desenvolvimento de novas metodologias para essa tarefa: (a) uma função critério, baseada na entropia de Tsallis, com objetivo de inferir os inter-relacionamentos gênicos com maior precisão; (b) uma estratégia alternativa de busca para a inferência de GRNs, chamada SFFS-MR, a qual tenta explorar uma característica local das interdependências regulatórias dos genes, conhecida como predição intrinsecamente multivariada; e (c) uma estratégia de busca, interativa e flutuante, que baseia-se na topologia de redes scale-free, como uma característica global das GRNs, considerada como uma informação a priori, com objetivo de oferecer um método mais adequado para essa classe de problemas e, com isso, obter resultados com maior precisão. Também é objetivo deste trabalho aplicar a metodologia desenvolvida em dados biológicos, em particular na identificação de GRNs relacionadas a funções específicas de Arabidopsis thaliana. Os resultados experimentais, obtidos a partir da aplicação das metodologias propostas, mostraram que os respectivos ganhos de desempenho foram significativos e adequados para os problemas a que foram propostos. / Thanks to recent advances in molecular biology and biochemistry, allied to an ever increasing amount of experimental data, the functional state of thousands of genes can now be extracted simultaneously by using methods such as DNA microarrays, SAGE, and more recently RNA-Seq, generating a massive volume of biological data. The mapping of gene transcription levels at large scale is motivated by the proposition that information of the functional state of an organism is broadly determined by its gene expression. However, the main limitation faced is the small number of samples (experiments) with huge dimensionalities (genes). Thus, it is necessary to develop new computational and statistics techniques to reduce the inherent estimation error committed in the presence of a small number of samples with large dimensionality. In this context, particularly important related investigations are the modeling and identification of gene regulatory networks from expression data sets. The main objective of this research is to infer how genes are regulated, bringing knowledge about the molecular interactions and metabolic activities of an organism. Such a knowledge is fundamental for many applications, such as disease treatment, therapeutic intervention strategies and drugs design, as well as for planning high-throughput new experiments. In this direction, this work presents some contributions: (1) feature selection software; (2) new approach for the generation of artificial gene networks (AGN); (3) criterion function based on Tsallis entropy; (4) alternative search strategies for GRNs inference: SFFS-MR and SFFS-BA; (5) biological investigation of GRNs involved in the thiamine biosynthesis by adopting the Arabidopsis thaliana as a model plant. The feature selection software is an open-source multiplataform graphical environment for bioinformatics problems, which supports many feature selection algorithms, criterion functions and graphic visualization tools. In particular, a feature selection method for GRNs inference is also implemented in the software. Although there are several methods proposed in the literature for the modeling and identification of GRNs, an important open problem regards: how to validate such methods and its results? This work presents a new approach for validation of such algorithms by considering three main aspects: (a) Artificial Gene Networks (AGNs) model generation through theoretical models of complex networks, which is used to simulate temporal expression data; (b) computational method for GRNs identification from temporal expression data; and (c) Validation of the identified AGN-based network through comparison with the original network. Through the development of the AGN model was possible the analysis and investigation of the characteristics of GRNs inference methods, leading to the development of a comparative study of four inference methods available in literature. The evaluation of inference methods led to the development of new methodologies for this task: (a) a new criterion function based on Tsallis entropy, in order to infer the genetic inter-relationships with better precision; (b) an alternative search strategy for the GRNs inference, called SFFS-MR, which tries to exploit a local property of the regulatory gene interdependencies, which is known as intrinsically multivariate prediction; and (c) a search strategy, interactive and floating, which is based on scale-free network topology, as a global property of the GRNs, which is considered as a priori information, in order to provide a more appropriate method for this class of problems and thereby achieve results with better precision. It is also an objective of this work, to apply the developed methodology in biological data, particularly in identifying GRNs related to specific functions of the Arabidopsis thaliana. The experimental results, obtained from the application of the proposed methodologies, indicate that the respective performances of each methodology were significant and adequate to the problems that have been proposed. entropia entropia de Tsallis inferência de redes reconhecimento de padrões redes complexas redes de regulação gênica seleção de características validação complex networks entropy feature selection gene regulatory networks network inference pattern recognition Tsallis entropy validation
437	Seleção de características e predição intrinsecamente multivariada em identificação de redes de regulação gênica / Feature selection and intrinsically multivariate prediction in gene regulatory networks identification David Corrêa Martins Junior 01 December 2008 (has links) Seleção de características é um tópico muito importante em aplicações de reconhecimento de padrões, especialmente em bioinformática, cujos problemas são geralmente tratados sobre um conjunto de dados envolvendo muitas variáveis e poucas observações. Este trabalho analisa aspectos de seleção de características no problema de identificação de redes de regulação gênica a partir de sinais de expressão gênica. Particularmente, propusemos um modelo de redes gênicas probabilísticas (PGN) que devolve uma rede construída a partir da aplicação recorrente de algoritmos de seleção de características orientados por uma função critério baseada em entropia condicional. Tal critério embute a estimação do erro por penalização de amostras raramente observadas. Resultados desse modelo aplicado a dados sintéticos e a conjuntos de dados de microarray de Plasmodium falciparum, um agente causador da malária, demonstram a validade dessa técnica, tendo sido capaz não apenas de reproduzir conhecimentos já produzidos anteriormente, como também de produzir novos resultados. Outro aspecto investigado nesta tese é o fenômeno da predição intrinsecamente multivariada (IMP), ou seja, o fato de um conjunto de características ser um ótimo caracterizador dos objetos em questão, mas qualquer de seus subconjuntos propriamente contidos não conseguirem representá-los de forma satisfatória. Neste trabalho, as condições para o surgimento desse fenômeno foram obtidas de forma analítica para conjuntos de 2 e 3 características em relação a uma variável alvo. No contexto de redes de regulação gênica, foram obtidas evidências de que genes alvo de conjuntos IMP possuem um enorme potencial para exercerem funções vitais em sistemas biológicos. O fenômeno conhecido como canalização é particularmente importante nesse contexto. Em dados de microarray de melanoma, constatamos que o gene DUSP1, conhecido por exercer função canalizadora, foi aquele que obteve o maior número de conjuntos de genes IMP, sendo que todos eles possuem lógicas de predição canalizadoras. Além disso, simulações computacionais para construção de redes com 3 ou mais genes mostram que o tamanho do território de um gene alvo pode ter um impacto positivo em seu teor de IMP com relação a seus preditores. Esta pode ser uma evidência que confirma a hipótese de que genes alvo de conjuntos IMP possuem a tendência de controlar diversas vias metabólicas cruciais para a manutenção das funções vitais de um organismo. / Feature selection is a crucial topic in pattern recognition applications, especially in bioinformatics, where problems usually involve data with a large number of variables and small number of observations. The present work addresses feature selection aspects in the problem of gene regulatory network identification from expression profiles. Particularly, we proposed a probabilistic genetic network model (PGN) that recovers a network constructed from the recurrent application of feature selection algorithms guided by a conditional entropy based criterion function. Such criterion embeds error estimation by penalization of rarely observed patterns. Results from this model applied to synthetic and real data sets obtained from Plasmodium falciparum microarrays, a malaria agent, demonstrate the validity of this technique. This method was able to not only reproduce previously produced knowledge, but also to produce other potentially relevant results. The intrinsically multivariate prediction (IMP) phenomenon has been also investigated. This phenomenon is related to the fact of a feature set being a nice predictor of the objects in study, but all of its properly contained subsets cannot predict such objects satisfactorily. In this work, the conditions for the rising of this phenomenon were analitically obtained for sets of 2 and 3 features regarding a target variable. In the gene regulatory networks context, evidences have been achieved in which target genes of IMP sets possess a great potential to execute vital functions in biological systems. The phenomenon known as canalization is particularly important in this context. In melanoma microarray data, we verified that DUSP1 gene, known by having canalization function, was the one which composed the largest number of IMP gene sets. It was also verified that all these sets have canalizing predictive logics. Moreover, computational simulations for generation of networks with 3 or more genes show that the territory size of a target gene can contribute positively to its IMP score with regard to its predictors. This could be an evidence that confirms the hypothesis stating that target genes of IMP sets are inclined to control several metabolic pathways essential to the maintenance of the vital functions of an organism. coeficiente de determinação entropia condicional média malária melanoma microarray predição intrinsecamente multivariada redes de regulação gênica seleção de características coefficient of determination feature selection gene regulatory networks intrinsically multivariate prediction malaria mean conditional entropy melanoma microarray
438	Rozpoznávání emocí v česky psaných textech / Recognition of emotions in Czech texts Červenec, Radek January 2011 (has links) With advances in information and communication technologies over the past few years, the amount of information stored in the form of electronic text documents has been rapidly growing. Since the human abilities to effectively process and analyze large amounts of information are limited, there is an increasing demand for tools enabling to automatically analyze these documents and benefit from their emotional content. These kinds of systems have extensive applications. The purpose of this work is to design and implement a system for identifying expression of emotions in Czech texts. The proposed system is based mainly on machine learning methods and therefore design and creation of a training set is described as well. The training set is eventually utilized to create a model of classifier using the SVM. For the purpose of improving classification results, additional components were integrated into the system, such as lexical database, lemmatizer or derived keyword dictionary. The thesis also presents results of text documents classification into defined emotion classes and evaluates various approaches to categorization.
439	Studium elektrofyziologických projevů srdce v experimentální kardiologii / Study of Electrophysiological Function of the Heart in Experimental Cardiology Ronzhina, Marina January 2017 (has links) Srdeční poruchy, jejichž příkladem je ischemie myokardu, infarkt, hypertrofie levé komory a myokarditida, jsou v experimentální kardiologii obvykle studovány na modelu izolovaného srdce. Kritéria pro detekci srdečních poruch však nejsou pro zvířecí modely standardizována, což komplikuje srovnání a interpretaci výsledků různých experimentálních studií. Obzvlášť složitá situace nastává při současném výskytu několika patologických jevů, jejichž vzájemná součinnost komplikuje rozpoznání jejich individuálních účinků. Korektní posouzení stavu srdce vyžaduje také zohlednění mnoha faktorů spojených s akvizicí dat. Tato práce je věnována kvantitativnímu hodnocení elektrofyziologických změn způsobených globální ischemií myokardu. Vliv ischemie byl hodnocen pro fyziologická srdce a srdce se zvětšenou levou komorou a dále pro srdce nabarvená napěťově-citlivým barvivem di-4-ANEPPS. Přestože jsou oba fenomény často zastoupeny v animálních studiích, nebyl dosud popsán jejich vliv na manifestaci ischemie v elektrogramech (EG), ani nebyl kvantifikován jejich vliv na přesnost detekčních algoritmů pro identifikaci ischemie. Práce shrnuje kvantitativní změny srdeční funkce vyvolané ischemií (v normálních podmínkách, při hypertrofii levé komory, a při administraci barviva) založené na hodnocení EG a VKG parametrů. Dále práce obsahuje rozbor důležitých aspektů akvizice záznamů, jako je umístění snímacích elektrod, způsob výpočtu deskriptorů z EG a VKG (s použitím výsledků manuálního rozměření záznamů, nebo bez něj) a identifikace okamžiku vývoje ischemie v preparátu. Nedílnou součást práce tvoří návrh, realizace a ověření metod pro automatickou detekci ischemie v experimentálních záznamech. Výsledky práce dokazují, že dosažení opakovatelných a věrohodných výsledků je podmíněno zohledněním všech výše uvedených faktorů souvisejících jak se stavem srdce, tak s metodikou záznamu a analýzy dat.
440	Automatic Flight Maneuver Identification Using Machine Learning Methods Bodin, Camilla January 2020 (has links) This thesis proposes a general approach to solve the offline flight-maneuver identification problem using machine learning methods. The purpose of the study was to provide means for the aircraft professionals at the flight test and verification department of Saab Aeronautics to automate the procedure of analyzing flight test data. The suggested approach succeeded in generating binary classifiers and multiclass classifiers that identified six flight maneuvers of different complexity from real flight test data. The binary classifiers solved the problem of identifying one maneuver from flight test data at a time, while the multiclass classifiers solved the problem of identifying several maneuvers from flight test data simultaneously. To achieve these results, the difficulties that this time series classification problem entailed were simplified by using different strategies. One strategy was to develop a maneuver extraction algorithm that used handcrafted rules. Another strategy was to represent the time series data by statistical measures. There was also an issue of an imbalanced dataset, where one class far outweighed others in number of samples. This was solved by using a modified oversampling method on the dataset that was used for training. Logistic Regression, Support Vector Machines with both linear and nonlinear kernels, and Artifical Neural Networks were explored, where the hyperparameters for each machine learning algorithm were chosen during model estimation by 4-fold cross-validation and solving an optimization problem based on important performance metrics. A feature selection algorithm was also used during model estimation to evaluate how the performance changes depending on how many features were used. The machine learning models were then evaluated on test data consisting of 24 flight tests. The results given by the test data set showed that the simplifications done were reasonable, but the maneuver extraction algorithm could sometimes fail. Some maneuvers were easier to identify than others and the linear machine learning models resulted in a poor fit to the more complex classes. In conclusion, both binary classifiers and multiclass classifiers could be used to solve the flight maneuver identification problem, and solving a hyperparameter optimization problem boosted the performance of the finalized models. Nonlinear classifiers performed the best on average across all explored maneuvers. Flight Aircraft Machine Learning Flight Dynamics Classification Supervised Learning Support Vector Machines Neural Networks Logistic Regression Feature Selection Recursive Feature Elimination Feature Representation k-fold cross-validation maneuvers flight maneuvers Control Engineering Reglerteknik

Search results