Global ETD Search

41	Propagação de secas na bacia do Rio Paraná: do evento climático ao impacto hidrológico / Drougth propagation in the Paraná river basin: from the climatic event to the hydrologic impact Melo, Davi de Carvalho Diniz 26 April 2017 (has links) Desastres naturais (secas, enchentes, etc) têm resultado em perdas humanas e grandes prejuízos financeiros em diversos lugares do mundo. Os recentes períodos de seca ocorridos na região sudeste do Brasil mostraram a importância de se dispor de estratégias de mitigação dos efeitos decorrentes desses eventos extremos. Um pré-requisito para prever impactos desses eventos no futuro, é compreender como os mesmos ocorreram no passado, caracterizando-os espacial e temporalmente. Diante do exposto, o objetivo deste trabalho é quantificar os impactos regionais no sistema hidrológico causados por eventos extremos e identificar conexões entre as secas meteorológicas e hidrológicas, usando a bacia do rio Paraná como estudo de caso. Para tanto, foram identificados e caracterizados os principais eventos de seca ocorridos entre 1995 e 2015, analisaram-se as perdas de água nos componentes do balanço hídrico e no armazenamento total de água. Foram utilizados dados de sensoriamento remoto, incluindo medições da missão GRACE de anomalias no armazenamento total de água terrestre (TWSA), e estimativas de precipitação e evapotranspiração pelos satélites TRMM e MODIS, respectivamente. Simulações de modelos globais de assimilação de dados de superfície terrestre forneceram estimativas de escoamento superficial e umidade do solo. Foram coletados dados de 37 reservatórios para quantificar as perdas de água no armazenamento em terra. Os resultados mostram que o TWSA diminuiu 150 ± 50 km3 entre 2011 e 2015 na bacia do rio Paraná, o armazenamento dos reservatórios diminuiu 30% em relação à capacidade máxima do sistema com taxas de -17 a -25 km3 ano-1 durante as secas. Foram identificados seis grupos de reservatórios cujas respostas são variáveis de acordo com tipo de forçante (natural ou antropogênica) de maior controle. A análise dos tempos de resposta do sistema hidrológico sugere um tempo de até aproximadamente 6 meses para que medidas de combate às secas sejam tomadas. Este estudo ressalta as vantagens do uso combinado de dados de diferentes fontes em estudos regionais. / Natural disasters have caused major economics and human losses globally. Recent droughts over Southeast Brazil underscored the importance of having mitigation strategies to fight the effects from extreme events and a prerequisite to anticipate the impacts from future events is an understanding of past droughts by means of spatial and temporal characterization. The objective of this study is to quantify regional impacts of extreme events on the hydrological system and identify linkages between meteorological and hydrological droughts. To this end, major droughts events between 1995 and 2015 were identified and characterized. Depletion in total water storage (TWS) and main components of the water budget were analyzed. Simulated soil moisture and runoff from land surface models and remote sensing data were used, including measurements of TWS anomalies (TWSA) data from GRACE mission, rainfall and evapotranspiration estimates from TRMM and MODIS satellites, respectively. To quantify reservoir storage depletion, data from 37 reservoirs were collected. Results show that TWSA declined by 150 ± 50 km3 between 2011 and 2015 in the Paraná basin; and reservoir storage decreased 30% relative to the system\'s maximum capacity, with negative trends ranging from -17 to -25 km3 yr-1 during the droughts. Six groups of reservoirs were identified whose response vary according to the main forcing type: human and/or natural controls. Analysis of the system\'s time lag responses indicated a 6 month window during which actions could be taken to combat the drought impacts. This study emphasizes the importance of integrating remote sensing, modelling and monitoring data to evaluate droughts and develop a comprehensive understanding of the linkages between meteorological and hydrological droughts for future management. Agrupamento hierárquico GLDAS GLDAS GRACE mission Hierarchical clustering Hydrological drought Missão GRACE Remote sensing Seca hidrológica Sensoriamento remoto
42	Révéler le contenu latent du code source : à la découverte des topoi de programme / Unveiling source code latent knowledge : discovering program topoi Ieva, Carlo 23 November 2018 (has links) Le développement de projets open source à grande échelle implique de nombreux développeurs distincts qui contribuent à la création de référentiels de code volumineux. À titre d'exemple, la version de juillet 2017 du noyau Linux (version 4.12), qui représente près de 20 lignes MLOC (lignes de code), a demandé l'effort de 329 développeurs, marquant une croissance de 1 MLOC par rapport à la version précédente. Ces chiffres montrent que, lorsqu'un nouveau développeur souhaite devenir un contributeur, il fait face au problème de la compréhension d'une énorme quantité de code, organisée sous la forme d'un ensemble non classifié de fichiers et de fonctions.Organiser le code de manière plus abstraite, plus proche de l'homme, est une tentative qui a suscité l'intérêt de la communauté du génie logiciel. Malheureusement, il n’existe pas de recette miracle ou bien d’outil connu pouvant apporter une aide concrète dans la gestion de grands bases de code.Nous proposons une approche efficace à ce problème en extrayant automatiquement des topoi de programmes, c'est à dire des listes ordonnées de noms de fonctions associés à un index de mots pertinents. Comment se passe le tri? Notre approche, nommée FEAT, ne considère pas toutes les fonctions comme égales: certaines d'entre elles sont considérées comme une passerelle vers la compréhension de capacités de haut niveau observables d'un programme. Nous appelons ces fonctions spéciales points d’entrée et le critère de tri est basé sur la distance entre les fonctions du programme et les points d’entrée. Notre approche peut être résumée selon ses trois étapes principales : 1) Preprocessing. Le code source, avec ses commentaires, est analysé pour générer, pour chaque unité de code (un langage procédural ou une méthode orientée objet), un document textuel correspondant. En outre, une représentation graphique de la relation appelant-appelé (graphe d'appel) est également créée à cette étape. 2) Clustering. Les unités de code sont regroupées au moyen d’une classification par clustering hiérarchique par agglomération (HAC). 3) Sélection du point d’entrée. Dans le contexte de chaque cluster, les unités de code sont classées et celles placées à des positions plus élevées constitueront un topos de programme.La contribution de cette thèse est triple: 1) FEAT est une nouvelle approche entièrement automatisée pour l'extraction de topoi de programme, basée sur le regroupement d'unités directement à partir du code source. Pour exploiter HAC, nous proposons une distance hybride originale combinant des éléments structurels et sémantiques du code source. HAC requiert la sélection d’une partition parmi toutes celles produites tout au long du processus de regroupement. Notre approche utilise un critère hybride basé sur la graph modularity et la cohérence textuelle pour sélectionner automatiquement le paramètre approprié. 2) Des groupes d’unités de code doivent être analysés pour extraire le programme topoi. Nous définissons un ensemble d'éléments structurels obtenus à partir du code source et les utilisons pour créer une représentation alternative de clusters d'unités de code. L’analyse en composantes principales, qui permet de traiter des données multidimensionnelles, nous permet de mesurer la distance entre les unités de code et le point d’entrée idéal. Cette distance est la base du classement des unités de code présenté aux utilisateurs finaux. 3) Nous avons implémenté FEAT comme une plate-forme d’analyse logicielle polyvalente et réalisé une étude expérimentale sur une base ouverte de 600 projets logiciels. Au cours de l’évaluation, nous avons analysé FEAT sous plusieurs angles: l’étape de mise en grappe, l’efficacité de la découverte de topoi et l’évolutivité de l’approche. / During the development of long lifespan software systems, specification documents can become outdated or can even disappear due to the turnover of software developers. Implementing new software releases or checking whether some user requirements are still valid thus becomes challenging. The only reliable development artifact in this context is source code but understanding source code of large projects is a time- and effort- consuming activity. This challenging problem can be addressed by extracting high-level (observable) capabilities of software systems. By automatically mining the source code and the available source-level documentation, it becomes possible to provide a significant help to the software developer in his/her program understanding task.This thesis proposes a new method and a tool, called FEAT (FEature As Topoi), to address this problem. Our approach automatically extracts program topoi from source code analysis by using a three steps process: First, FEAT creates a model of a software system capturing both structural and semantic elements of the source code, augmented with code-level comments; Second, it creates groups of closely related functions through hierarchical agglomerative clustering; Third, within the context of every cluster, functions are ranked and selected, according to some structural properties, in order to form program topoi.The contributions of the thesis is three-fold:1) The notion of program topoi is introduced and discussed from a theoretical standpoint with respect to other notions used in program understanding ;2) At the core of the clustering method used in FEAT, we propose a new hybrid distance combining both semantic and structural elements automatically extracted from source code and comments. This distance is parametrized and the impact of the parameter is strongly assessed through a deep experimental evaluation ;3) Our tool FEAT has been assessed in collaboration with Software Heritage (SH), a large-scale ambitious initiative whose aim is to collect, preserve and, share all publicly available source code on earth. We performed a large experimental evaluation of FEAT on 600 open source projects of SH, coming from various domains and amounting to more than 25 MLOC (million lines of code).Our results show that FEAT can handle projects of size up to 4,000 functions and several hundreds of files, which opens the door for its large-scale adoption for program understanding. Apprentissage automatique Clustering hiérarchique Programme compréhension Machine Learning Hierarchical Clustering Program Comprehension Feature location Feature Extraction
43	Representações hierárquicas de vocábulos de línguas indígenas brasileiras: modelos baseados em mistura de Gaussianas / Hierarchical representations of words of brazilian indigenous languages: models based on Gaussian mixture Sepúlveda Torres, Lianet 08 December 2010 (has links) Apesar da ampla diversidade de línguas indígenas no Brasil, poucas pesquisas estudam estas línguas e suas relações. Inúmeros esforços têm sido dedicados a procurar similaridades entre as palavras das línguas indígenas e classificá-las em famílias de línguas. Seguindo a classificação mais aceita das línguas indígenas do Brasil, esta pesquisa propõe comparar palavras de 10 línguas indígenas brasileiras. Para isso, considera-se que estas palavras são sinais de fala e estima-se a função de distribuição de probabilidade (PDF) de cada palavra, usando um modelo de mistura de gaussianas (GMM). A PDF foi considerada um modelo para representar as palavras. Os modelos foram comparados utilizando medidas de distância para construir estruturas hierárquicas que evidenciaram possíveis relações entre as palavras. Seguindo esta linha, a hipótese levantada nesta pesquisa é que as PDFs baseadas em GMM conseguem caracterizar as palavras das línguas indígenas, permitindo o emprego de medidas de distância entre elas para estabelecer relações entre as palavras, de forma que tais relações confirmem algumas das classificações. Os parâmetros do GMM foram calculados utilizando o algoritmo Maximização da Expectância (em inglês, Expectation Maximization (EM)). A divergência Kullback Leibler (KL) foi empregada para medir semelhança entre as PDFs. Esta divergência serve de base para estabelecer as estruturas hierárquicas que ilustram as relações entre os modelos. A estimativa da PDF, baseada em GMM foi testada com o auxílio de sinais simulados, sendo possível confirmar que os parâmetros obtidos são próximos dos originais. Foram implementadas várias medidas de distância para avaliar se a semelhança entre os modelos estavam determinadas pelos modelos e não pelas medidas adotadas neste estudo. Os resultados de todas as medidas foram similares, somente foi observada alguma diferença nos agrupamentos realizados pela distância C2, por isso foi proposta como complemento da divergência KL. Estes resultados sugerem que as relações entre os modelos dependem das suas características, não das métricas de distância selecionadas no estudo e que as PDFs baseadas em GMM, conseguem fazer uma caracterização adequada das palavras. Em geral, foram observados agrupamentos entre palavras que pertenciam a línguas de um mesmo tronco linguístico, assim como se observou uma tendência a incluir línguas isoladas nos agrupamentos dos troncos linguísticos. Palavras que pertenciam a determinada língua apresentaram um comportamento padrão, sendo identificadas por esse tipo de comportamento. Embora os resultados para as palavras das línguas indígenas sejam inconclusivos, considera-se que o estudo foi útil para aumentar o conhecimento destas 10 línguas estudadas, propondo novas linhas de pesquisas dedicadas à análise destas palavras. / Although there exists a large diversity of indigenous languages in Brazil, there are few researches on these languages and their relationships. Numerous efforts have been dedicated to search for similarities among words of indigenous languages to classify them into families. Following the most accepted classification of Brazilian indigenous languages, this research proposes to compare words of 10 Brazilian indigenous languages. The words of the indigenous languages are considered speech signals and the Probability Distribution Function (PDF) of each word was estimated using the Gaussian Mixture Models (GMM). This estimation was considered a model to represent each word. The models were compared using distance measures to construct hierarchical structures that illustrate possible relationships among words. The hypothesis in this research is that the estimation of the PDF, based on GMM can characterize the words of indigenous languages, allowing the use of distance measures between the PDFs to establish relationships among the words and confirm some of the classifications. The Expectation Maximization algorithm (EM) was implemented to estimate the parameters that describe the GMM. The Kullback Leibler (KL) divergence was used to measure similarities between two PDFs. This divergence is the basis to establish the hierarchical structures that show the relationships among the models. The PDF estimation, based on GMM was tested using simulated signals, allowing confirming the useful approximation of the original parameters. Several distance measures were implemented to prove that the similarities among the models depended on the model of each word, and not on the distance measure adopted in this study. The results of all measures were similar, however, as the clustering results of the C2 distances showed some differences from the other clusters, C2 distance was proposed to complement the KL divergence. The results suggest that the relationships between models depend on their characteristics, and not on the distance measures selected in this study, and the PDFs based on GMM can properly characterize the words. In general, relations among languages that belong to the same linguistic branch were illustrated, showing a tendency to include isolated languages in groups of languages that belong to the same linguistic branches. As the GMM of some language families presents a standard behavior, it allows identifying each family. Although the results of the words of indigenous languages are inconclusive, this study is considered very useful to increase the knowledge of these types of languages and to propose new research lines directed to analyze this type of signals. Agrupamento hierárquico Dendogram Dendrograma Divergência KL Gaussian mixture models Hierarchical clustering Indigenous languages KL divergence Línguas indígenas Mistura de gaussianas
44	Melhorando o desempenho da técnica de clusterização hierárquica single linkage utilizando a metaheurística GRASP Ribeiro Filho, Napoleão Póvoa 30 March 2016 (has links) O problema de clusterização (agrupamento) consiste em, a partir de uma base de dados, agrupar os elementos de modo que os mais similares fiquem no mesmo cluster (grupo), e os elementos menos similares fiquem em clusters distintos. Há várias maneiras de se realizar esses agrupamentos. Uma das mais populares é a hierárquica, onde é criada uma hierarquia de relacionamentos entre os elementos. Há vários métodos de se analisar a similaridade entre elementos no problema de clusterização. O mais utilizado entre eles é o método single linkage, que agrupa os elementos que apresentarem menor distância entre si. Para se aplicar a técnica em questão, uma matriz de distâncias é a entrada utilizada. Esse processo de agrupamento gera ao final uma árvore invertida conhecida como dendrograma. O coeficiente de correlação cofenética (ccc), obtido após a construção do dendrograma, é utilizado para avaliar a consistência dos agrupamentos gerados e indica o quão fiel o dendrograma está em relação aos dados originais. Dessa forma, um dendrograma apresenta agrupamentos mais consistentes quando o ccc for o mais próximo de um (1) . O problema de clusterização em todas as suas vertentes, inclusive a clusterização hierárquica (objeto de estudo nesse trabalho), pertence a classe de problemas NP-Completo. Assim sendo, é comum o uso de heurísticas para obter soluções de modo eficiente para esse problema. Com o objetivo de gerar dendrogramas que resultem em melhores ccc, é proposto no presente trabalho um novo algoritmo que utiliza os conceitos da metaheurística GRASP. Também é objetivo deste trabalho implementar tal solução em computação paralela em um cluster computacional, permitindo assim trabalhar com matrizes de dimensões maiores. Testes foram realizados para comprovar o desempenho do algoritmo proposto, comparando os resultados obtidos com os gerados pelo software R. / The problem of clustering (grouping) consists of, from a database, group the elements so that more queries are in the same cluster (group) and less similar elements are different clusters. There are several ways to accomplish these groupings. One of the most popular is the hierarchical, where a hierarchical relationships between the elements is created. There are several methods of analyzing the similarity between elements in the clustering problem. The most common among them is the single linkage method, which brings together the elements that are experiencing less apart. To apply the technique in question, distance matrix is the input used. This grouping process generates the end an inverted tree known as dendrogram. The cophenetic correlation coefficient (ccc), obtained after the construction of the dendrogram is a measure used to evaluate the consistency of the clusters generated and indicates how faithful he is in relation to the original data. Thus, a dendrogram gives more consistent clusters when the ccc is closer to one (1). The clustering problem in all its aspects, including hierarchical clustering (object of study in this work), belongs to the class of NP-complete problems. Therefore, it is common to use heuristics for efficient solutions to this problem. In order to generate dendrograms that result in better ccc, it is proposed in this paper a new algorithm that uses the concepts of GRASP metaheuristic. It is also objective of this work to implement such a solution in parallel computing in a computer cluster, thus working with arrays larger. Tests were conducted to confirm the performance of the proposed algorithm, comparing the results with those generated by the software R. GRASP Clusterização Hierárquica Coeficiente de Correlação Cofenética Hierarchical clustering Cophenetic Correlation Coefficient
45	Métodos Bayesianos aplicados em taxonomia molecular / Bayesian methods applied in molecular taxonomy Edwin Rafael Villanueva Talavera 31 August 2007 (has links) Neste trabalho são apresentados dois métodos de agrupamento de dados visados para aplicações em taxonomia molecular. Estes métodos estão baseados em modelos probabilísticos, o que permite superar alguns problemas apresentados nos métodos não probabilísticos existentes, como a dificuldade na escolha da métrica de distância e a falta de tratamento e aproveitamento do conhecimento a priori disponível. Os métodos apresentados combinam por meio do teorema de Bayes a informação extraída dos dados com o conhecimento a priori que se dispõe, razão pela qual são denominados métodos Bayesianos. O primeiro método, método de agrupamento hierárquico Bayesiano, está baseado no algoritmo HBC (Hierarchical Bayesian Clustering). Este método constrói uma hierarquia de partições (dendrograma) baseado no critério da máxima probabilidade a posteriori de cada partição. O segundo método é baseado em um tipo de modelo gráfico probabilístico conhecido como redes Gaussianas condicionais, o qual foi adaptado para problemas de agrupamento. Ambos métodos foram avaliados em três bancos de dados donde se conhece a rótulo da classe. Os métodos foram usados também em um problema de aplicação real: a taxonomia de uma coleção brasileira de estirpes de bactérias do gênero Bradyrhizobium (conhecidas por sua capacidade de fixar o \'N IND.2\' do ar no solo). Este banco de dados é composto por dados genotípicos resultantes da análise do RNA ribossômico. Os resultados mostraram que o método hierárquico Bayesiano gera dendrogramas de boa qualidade, em alguns casos superior que o melhor dos algoritmos hierárquicos analisados. O método baseado em redes gaussianas condicionais também apresentou resultados aceitáveis, mostrando um adequado aproveitamento do conhecimento a priori sobre as classes tanto na determinação do número ótimo de grupos, quanto no melhoramento da qualidade dos agrupamentos. / In this work are presented two clustering methods thought to be applied in molecular taxonomy. These methods are based in probabilistic models which overcome some problems observed in traditional clustering methods such as the difficulty to know which distance metric must be used or the lack of treatment of available prior information. The proposed methods use the Bayes theorem to combine the information of the data with the available prior information, reason why they are called Bayesian methods. The first method implemented in this work was the hierarchical Bayesian clustering, which is an agglomerative hierarchical method that constructs a hierarchy of partitions (dendogram) guided by the criterion of maximum Bayesian posterior probability of the partition. The second method is based in a type of probabilistic graphical model knows as conditional Gaussian network, which was adapted for data clustering. Both methods were validated in 3 datasets where the labels are known. The methods were used too in a real problem: the clustering of a brazilian collection of bacterial strains belonging to the genus Bradyrhizobium, known by their capacity to transform the nitrogen (\'N IND.2\') of the atmosphere into nitrogen compounds useful for the host plants. This dataset is formed by genetic data resulting of the analysis of the ribosomal RNA. The results shown that the hierarchical Bayesian clustering method built dendrograms with good quality, in some cases, better than the other hierarchical methods. In the method based in conditional Gaussian network was observed acceptable results, showing an adequate utilization of the prior information (about the clusters) to determine the optimal number of clusters and to improve the quality of the groups. Agrupamento Agrupamento hierárquico Modelos gráficos probabilísticos Modelos probabilísticos Taxonomia molecular Clustering Hierarchical clustering Molecular taxonomy Probabilistic graphical models Probabilistic models
46	Representações hierárquicas de vocábulos de línguas indígenas brasileiras: modelos baseados em mistura de Gaussianas / Hierarchical representations of words of brazilian indigenous languages: models based on Gaussian mixture Lianet Sepúlveda Torres 08 December 2010 (has links) Apesar da ampla diversidade de línguas indígenas no Brasil, poucas pesquisas estudam estas línguas e suas relações. Inúmeros esforços têm sido dedicados a procurar similaridades entre as palavras das línguas indígenas e classificá-las em famílias de línguas. Seguindo a classificação mais aceita das línguas indígenas do Brasil, esta pesquisa propõe comparar palavras de 10 línguas indígenas brasileiras. Para isso, considera-se que estas palavras são sinais de fala e estima-se a função de distribuição de probabilidade (PDF) de cada palavra, usando um modelo de mistura de gaussianas (GMM). A PDF foi considerada um modelo para representar as palavras. Os modelos foram comparados utilizando medidas de distância para construir estruturas hierárquicas que evidenciaram possíveis relações entre as palavras. Seguindo esta linha, a hipótese levantada nesta pesquisa é que as PDFs baseadas em GMM conseguem caracterizar as palavras das línguas indígenas, permitindo o emprego de medidas de distância entre elas para estabelecer relações entre as palavras, de forma que tais relações confirmem algumas das classificações. Os parâmetros do GMM foram calculados utilizando o algoritmo Maximização da Expectância (em inglês, Expectation Maximization (EM)). A divergência Kullback Leibler (KL) foi empregada para medir semelhança entre as PDFs. Esta divergência serve de base para estabelecer as estruturas hierárquicas que ilustram as relações entre os modelos. A estimativa da PDF, baseada em GMM foi testada com o auxílio de sinais simulados, sendo possível confirmar que os parâmetros obtidos são próximos dos originais. Foram implementadas várias medidas de distância para avaliar se a semelhança entre os modelos estavam determinadas pelos modelos e não pelas medidas adotadas neste estudo. Os resultados de todas as medidas foram similares, somente foi observada alguma diferença nos agrupamentos realizados pela distância C2, por isso foi proposta como complemento da divergência KL. Estes resultados sugerem que as relações entre os modelos dependem das suas características, não das métricas de distância selecionadas no estudo e que as PDFs baseadas em GMM, conseguem fazer uma caracterização adequada das palavras. Em geral, foram observados agrupamentos entre palavras que pertenciam a línguas de um mesmo tronco linguístico, assim como se observou uma tendência a incluir línguas isoladas nos agrupamentos dos troncos linguísticos. Palavras que pertenciam a determinada língua apresentaram um comportamento padrão, sendo identificadas por esse tipo de comportamento. Embora os resultados para as palavras das línguas indígenas sejam inconclusivos, considera-se que o estudo foi útil para aumentar o conhecimento destas 10 línguas estudadas, propondo novas linhas de pesquisas dedicadas à análise destas palavras. / Although there exists a large diversity of indigenous languages in Brazil, there are few researches on these languages and their relationships. Numerous efforts have been dedicated to search for similarities among words of indigenous languages to classify them into families. Following the most accepted classification of Brazilian indigenous languages, this research proposes to compare words of 10 Brazilian indigenous languages. The words of the indigenous languages are considered speech signals and the Probability Distribution Function (PDF) of each word was estimated using the Gaussian Mixture Models (GMM). This estimation was considered a model to represent each word. The models were compared using distance measures to construct hierarchical structures that illustrate possible relationships among words. The hypothesis in this research is that the estimation of the PDF, based on GMM can characterize the words of indigenous languages, allowing the use of distance measures between the PDFs to establish relationships among the words and confirm some of the classifications. The Expectation Maximization algorithm (EM) was implemented to estimate the parameters that describe the GMM. The Kullback Leibler (KL) divergence was used to measure similarities between two PDFs. This divergence is the basis to establish the hierarchical structures that show the relationships among the models. The PDF estimation, based on GMM was tested using simulated signals, allowing confirming the useful approximation of the original parameters. Several distance measures were implemented to prove that the similarities among the models depended on the model of each word, and not on the distance measure adopted in this study. The results of all measures were similar, however, as the clustering results of the C2 distances showed some differences from the other clusters, C2 distance was proposed to complement the KL divergence. The results suggest that the relationships between models depend on their characteristics, and not on the distance measures selected in this study, and the PDFs based on GMM can properly characterize the words. In general, relations among languages that belong to the same linguistic branch were illustrated, showing a tendency to include isolated languages in groups of languages that belong to the same linguistic branches. As the GMM of some language families presents a standard behavior, it allows identifying each family. Although the results of the words of indigenous languages are inconclusive, this study is considered very useful to increase the knowledge of these types of languages and to propose new research lines directed to analyze this type of signals. Agrupamento hierárquico Dendrograma Divergência KL Línguas indígenas Mistura de gaussianas Dendogram Gaussian mixture models Hierarchical clustering Indigenous languages KL divergence
47	Métodos Bayesianos aplicados em taxonomia molecular / Bayesian methods applied in molecular taxonomy Villanueva Talavera, Edwin Rafael 31 August 2007 (has links) Neste trabalho são apresentados dois métodos de agrupamento de dados visados para aplicações em taxonomia molecular. Estes métodos estão baseados em modelos probabilísticos, o que permite superar alguns problemas apresentados nos métodos não probabilísticos existentes, como a dificuldade na escolha da métrica de distância e a falta de tratamento e aproveitamento do conhecimento a priori disponível. Os métodos apresentados combinam por meio do teorema de Bayes a informação extraída dos dados com o conhecimento a priori que se dispõe, razão pela qual são denominados métodos Bayesianos. O primeiro método, método de agrupamento hierárquico Bayesiano, está baseado no algoritmo HBC (Hierarchical Bayesian Clustering). Este método constrói uma hierarquia de partições (dendrograma) baseado no critério da máxima probabilidade a posteriori de cada partição. O segundo método é baseado em um tipo de modelo gráfico probabilístico conhecido como redes Gaussianas condicionais, o qual foi adaptado para problemas de agrupamento. Ambos métodos foram avaliados em três bancos de dados donde se conhece a rótulo da classe. Os métodos foram usados também em um problema de aplicação real: a taxonomia de uma coleção brasileira de estirpes de bactérias do gênero Bradyrhizobium (conhecidas por sua capacidade de fixar o \'N IND.2\' do ar no solo). Este banco de dados é composto por dados genotípicos resultantes da análise do RNA ribossômico. Os resultados mostraram que o método hierárquico Bayesiano gera dendrogramas de boa qualidade, em alguns casos superior que o melhor dos algoritmos hierárquicos analisados. O método baseado em redes gaussianas condicionais também apresentou resultados aceitáveis, mostrando um adequado aproveitamento do conhecimento a priori sobre as classes tanto na determinação do número ótimo de grupos, quanto no melhoramento da qualidade dos agrupamentos. / In this work are presented two clustering methods thought to be applied in molecular taxonomy. These methods are based in probabilistic models which overcome some problems observed in traditional clustering methods such as the difficulty to know which distance metric must be used or the lack of treatment of available prior information. The proposed methods use the Bayes theorem to combine the information of the data with the available prior information, reason why they are called Bayesian methods. The first method implemented in this work was the hierarchical Bayesian clustering, which is an agglomerative hierarchical method that constructs a hierarchy of partitions (dendogram) guided by the criterion of maximum Bayesian posterior probability of the partition. The second method is based in a type of probabilistic graphical model knows as conditional Gaussian network, which was adapted for data clustering. Both methods were validated in 3 datasets where the labels are known. The methods were used too in a real problem: the clustering of a brazilian collection of bacterial strains belonging to the genus Bradyrhizobium, known by their capacity to transform the nitrogen (\'N IND.2\') of the atmosphere into nitrogen compounds useful for the host plants. This dataset is formed by genetic data resulting of the analysis of the ribosomal RNA. The results shown that the hierarchical Bayesian clustering method built dendrograms with good quality, in some cases, better than the other hierarchical methods. In the method based in conditional Gaussian network was observed acceptable results, showing an adequate utilization of the prior information (about the clusters) to determine the optimal number of clusters and to improve the quality of the groups. Agrupamento Agrupamento hierárquico Clustering Hierarchical clustering Modelos gráficos probabilísticos Modelos probabilísticos Molecular taxonomy Probabilistic graphical models Probabilistic models Taxonomia molecular
48	A Clustering-based Approach to Document-Category Integration Cheng, Tsang-Hsiang 04 September 2003 (has links) E-commerce applications generate and consume tremendous amount of online information that is typically available as textual documents. Observations of textual document management practices by organizations or individuals suggest the popularity of using categories (or category hierarchies) to organize, archive and access documents. On the other hand, an organization (or individual) also constantly acquires new documents from various Internet sources. Consequently, integration of relevant categorized documents into existent categories of the organization (or individual) becomes an important issue in the e-commerce era. Existing categorization-based approach for document-category integration (specifically, the Enhanced Naïve Bayes classifier) incurs several limitations, including homogeneous assumption on categorization schemes used by master and source catalogs and requirement for a large-sized master categories as training data. In this study, we developed a Clustering-based Category Integration (CCI) technique to deal with integrating two document catalogs each of which is organized non-hierarchically (i.e., in a flat set). Using the Enhanced Naïve Bayes classifier as benchmarks, the empirical evaluation results showed that the proposed CCI technique appeared to improve the effectiveness of document-category integration accuracy in different integration scenarios and seemed to be less sensitive to the size of master categories than the categorization-based approach. Furthermore, to integrate the document categories that are organized hierarchically, we proposed a Clustering-based category-Hierarchy Integration (referred to as CHI) technique extended the CCI technique and for category-hierarchy integration. The empirical evaluation results showed that the CHI technique appeared to improve the effectiveness of hierarchical document-category integration than that attained by CCI under homogeneous and comparable scenarios. Catalog Integration Document Category Integration Naïve Bayes Classifier Hierarchical Category Integration Hierarchical Clustering Document Clustering
49	Use of Machine Learning Algorithms to Propose a New Methodology to Conduct, Critique and Validate Urban Scale Building Energy Modeling January 2017 (has links) abstract: City administrators and real-estate developers have been setting up rather aggressive energy efficiency targets. This, in turn, has led the building science research groups across the globe to focus on urban scale building performance studies and level of abstraction associated with the simulations of the same. The increasing maturity of the stakeholders towards energy efficiency and creating comfortable working environment has led researchers to develop methodologies and tools for addressing the policy driven interventions whether it’s urban level energy systems, buildings’ operational optimization or retrofit guidelines. Typically, these large-scale simulations are carried out by grouping buildings based on their design similarities i.e. standardization of the buildings. Such an approach does not necessarily lead to potential working inputs which can make decision-making effective. To address this, a novel approach is proposed in the present study. The principle objective of this study is to propose, to define and evaluate the methodology to utilize machine learning algorithms in defining representative building archetypes for the Stock-level Building Energy Modeling (SBEM) which are based on operational parameter database. The study uses “Phoenix- climate” based CBECS-2012 survey microdata for analysis and validation. Using the database, parameter correlations are studied to understand the relation between input parameters and the energy performance. Contrary to precedence, the study establishes that the energy performance is better explained by the non-linear models. The non-linear behavior is explained by advanced learning algorithms. Based on these algorithms, the buildings at study are grouped into meaningful clusters. The cluster “mediod” (statistically the centroid, meaning building that can be represented as the centroid of the cluster) are established statistically to identify the level of abstraction that is acceptable for the whole building energy simulations and post that the retrofit decision-making. Further, the methodology is validated by conducting Monte-Carlo simulations on 13 key input simulation parameters. The sensitivity analysis of these 13 parameters is utilized to identify the optimum retrofits. From the sample analysis, the envelope parameters are found to be more sensitive towards the EUI of the building and thus retrofit packages should also be directed to maximize the energy usage reduction. / Dissertation/Thesis / Masters Thesis Architecture 2017 Architectural engineering Energy Climate change Building Energy Simulation Data visualisation Hierarchical clustering Non-linear models Sensitivity Analysis Urban Building Energy Modeling
50	Propagação de secas na bacia do Rio Paraná: do evento climático ao impacto hidrológico / Drougth propagation in the Paraná river basin: from the climatic event to the hydrologic impact Davi de Carvalho Diniz Melo 26 April 2017 (has links) Desastres naturais (secas, enchentes, etc) têm resultado em perdas humanas e grandes prejuízos financeiros em diversos lugares do mundo. Os recentes períodos de seca ocorridos na região sudeste do Brasil mostraram a importância de se dispor de estratégias de mitigação dos efeitos decorrentes desses eventos extremos. Um pré-requisito para prever impactos desses eventos no futuro, é compreender como os mesmos ocorreram no passado, caracterizando-os espacial e temporalmente. Diante do exposto, o objetivo deste trabalho é quantificar os impactos regionais no sistema hidrológico causados por eventos extremos e identificar conexões entre as secas meteorológicas e hidrológicas, usando a bacia do rio Paraná como estudo de caso. Para tanto, foram identificados e caracterizados os principais eventos de seca ocorridos entre 1995 e 2015, analisaram-se as perdas de água nos componentes do balanço hídrico e no armazenamento total de água. Foram utilizados dados de sensoriamento remoto, incluindo medições da missão GRACE de anomalias no armazenamento total de água terrestre (TWSA), e estimativas de precipitação e evapotranspiração pelos satélites TRMM e MODIS, respectivamente. Simulações de modelos globais de assimilação de dados de superfície terrestre forneceram estimativas de escoamento superficial e umidade do solo. Foram coletados dados de 37 reservatórios para quantificar as perdas de água no armazenamento em terra. Os resultados mostram que o TWSA diminuiu 150 ± 50 km3 entre 2011 e 2015 na bacia do rio Paraná, o armazenamento dos reservatórios diminuiu 30% em relação à capacidade máxima do sistema com taxas de -17 a -25 km3 ano-1 durante as secas. Foram identificados seis grupos de reservatórios cujas respostas são variáveis de acordo com tipo de forçante (natural ou antropogênica) de maior controle. A análise dos tempos de resposta do sistema hidrológico sugere um tempo de até aproximadamente 6 meses para que medidas de combate às secas sejam tomadas. Este estudo ressalta as vantagens do uso combinado de dados de diferentes fontes em estudos regionais. / Natural disasters have caused major economics and human losses globally. Recent droughts over Southeast Brazil underscored the importance of having mitigation strategies to fight the effects from extreme events and a prerequisite to anticipate the impacts from future events is an understanding of past droughts by means of spatial and temporal characterization. The objective of this study is to quantify regional impacts of extreme events on the hydrological system and identify linkages between meteorological and hydrological droughts. To this end, major droughts events between 1995 and 2015 were identified and characterized. Depletion in total water storage (TWS) and main components of the water budget were analyzed. Simulated soil moisture and runoff from land surface models and remote sensing data were used, including measurements of TWS anomalies (TWSA) data from GRACE mission, rainfall and evapotranspiration estimates from TRMM and MODIS satellites, respectively. To quantify reservoir storage depletion, data from 37 reservoirs were collected. Results show that TWSA declined by 150 ± 50 km3 between 2011 and 2015 in the Paraná basin; and reservoir storage decreased 30% relative to the system\'s maximum capacity, with negative trends ranging from -17 to -25 km3 yr-1 during the droughts. Six groups of reservoirs were identified whose response vary according to the main forcing type: human and/or natural controls. Analysis of the system\'s time lag responses indicated a 6 month window during which actions could be taken to combat the drought impacts. This study emphasizes the importance of integrating remote sensing, modelling and monitoring data to evaluate droughts and develop a comprehensive understanding of the linkages between meteorological and hydrological droughts for future management. Agrupamento hierárquico GLDAS Missão GRACE Seca hidrológica Sensoriamento remoto GLDAS GRACE mission Hierarchical clustering Hydrological drought Remote sensing

Search results