321 |
Species Distribution Modeling: Implications of Modeling Approaches, Biotic Effects, Sample Size, and Detection LimitWang, Lifei 14 January 2014 (has links)
When we develop and use species distribution models to predict species' current or potential distributions, we are faced with the trade-offs between model generality, precision, and realism. It is important to know how to improve and validate model generality while maintaining good model precision and realism. However, it is difficult for ecologists to evaluate species distribution models using field-sampled data alone because the true species response function to environmental or ecological factors is unknown. Species distribution models should be able to approximate the true characteristics and distributions of species if ecologists want to use them as reliable tools. Simulated data provide the advantage of being able to know the true species-environment relationships and control the causal factors of interest to obtain insights into the effects of these factors on model performance. I used a case study on Bythotrephes longimanus distributions from several hundred Ontario lakes and a simulation study to explore the effects on model performance caused by several factors: the choice of predictor variables, the model evaluation methods, the quantity and quality of the data used for developing models, and the strengths and weaknesses of different species distribution models. Linear discriminant analysis, multiple logistic regression, random forests, and artificial neural networks were compared in both studies. Results based on field data sampled from lakes indicated that the predictive performance of the four models was more variable when developed on abiotic (physical and chemical) conditions alone, whereas the generality of these models improved when including biotic (relevant species) information. When using simulated data, although the overall performance of random forests and artificial neural networks was better than linear discriminant analysis and multiple logistic regression, linear discriminant analysis and multiple logistic regression had relatively good and stable model sensitivity at different sample size and detection limit levels, which may be useful for predicting species presences when data are limited. Random forests performed consistently well at different sample size levels, but was more sensitive to high detection limit. The performance of artificial neural networks was affected by both sample size and detection limit, and it was more sensitive to small sample size.
|
322 |
The file fragment classification problem : a combined neural network and linear programming discriminant model approach / Erich Feodor WilgenbusWilgenbus, Erich Feodor January 2013 (has links)
The increased use of digital media to store legal, as well as illegal data, has created the need
for specialized tools that can monitor, control and even recover this data. An important task
in computer forensics and security is to identify the true le type to which a computer le
or computer le fragment belongs. File type identi cation is traditionally done by means
of metadata, such as le extensions and le header and footer signatures. As a result,
traditional metadata-based le object type identi cation techniques work well in cases where
the required metadata is available and unaltered. However, traditional approaches are not
reliable when the integrity of metadata is not guaranteed or metadata is unavailable. As
an alternative, any pattern in the content of a le object can be used to determine the
associated le type. This is called content-based le object type identi cation.
Supervised learning techniques can be used to infer a le object type classi er by exploiting
some unique pattern that underlies a le type's common le structure. This study builds
on existing literature regarding the use of supervised learning techniques for content-based
le object type identi cation, and explores the combined use of multilayer perceptron neural
network classi ers and linear programming-based discriminant classi ers as a solution to the
multiple class le fragment type identi cation problem.
The purpose of this study was to investigate and compare the use of a single multilayer
perceptron neural network classi er, a single linear programming-based discriminant classi-
er and a combined ensemble of these classi ers in the eld of le type identi cation. The
ability of each individual classi er and the ensemble of these classi ers to accurately predict
the le type to which a le fragment belongs were tested empirically.
The study found that both a multilayer perceptron neural network and a linear programming-
based discriminant classi er (used in a round robin) seemed to perform well in solving
the multiple class le fragment type identi cation problem. The results of combining
multilayer perceptron neural network classi ers and linear programming-based discriminant
classi ers in an ensemble were not better than those of the single optimized classi ers. / MSc (Computer Science), North-West University, Potchefstroom Campus, 2013
|
323 |
The file fragment classification problem : a combined neural network and linear programming discriminant model approach / Erich Feodor WilgenbusWilgenbus, Erich Feodor January 2013 (has links)
The increased use of digital media to store legal, as well as illegal data, has created the need
for specialized tools that can monitor, control and even recover this data. An important task
in computer forensics and security is to identify the true le type to which a computer le
or computer le fragment belongs. File type identi cation is traditionally done by means
of metadata, such as le extensions and le header and footer signatures. As a result,
traditional metadata-based le object type identi cation techniques work well in cases where
the required metadata is available and unaltered. However, traditional approaches are not
reliable when the integrity of metadata is not guaranteed or metadata is unavailable. As
an alternative, any pattern in the content of a le object can be used to determine the
associated le type. This is called content-based le object type identi cation.
Supervised learning techniques can be used to infer a le object type classi er by exploiting
some unique pattern that underlies a le type's common le structure. This study builds
on existing literature regarding the use of supervised learning techniques for content-based
le object type identi cation, and explores the combined use of multilayer perceptron neural
network classi ers and linear programming-based discriminant classi ers as a solution to the
multiple class le fragment type identi cation problem.
The purpose of this study was to investigate and compare the use of a single multilayer
perceptron neural network classi er, a single linear programming-based discriminant classi-
er and a combined ensemble of these classi ers in the eld of le type identi cation. The
ability of each individual classi er and the ensemble of these classi ers to accurately predict
the le type to which a le fragment belongs were tested empirically.
The study found that both a multilayer perceptron neural network and a linear programming-
based discriminant classi er (used in a round robin) seemed to perform well in solving
the multiple class le fragment type identi cation problem. The results of combining
multilayer perceptron neural network classi ers and linear programming-based discriminant
classi ers in an ensemble were not better than those of the single optimized classi ers. / MSc (Computer Science), North-West University, Potchefstroom Campus, 2013
|
324 |
Generalized N-body problems: a framework for scalable computationRiegel, Ryan Nelson 13 January 2014 (has links)
In the wake of the Big Data phenomenon, the computing world has seen a number of computational paradigms developed in response to the sudden need to process ever-increasing volumes of data. Most notably, MapReduce has proven quite successful in scaling out an extensible class of simple algorithms to even hundreds of thousands of nodes. However, there are some tasks---even embarrassingly parallelizable ones---that neither MapReduce nor any existing automated parallelization framework is well-equipped to perform. For instance, any computation that (naively) requires consideration of all pairs of inputs becomes prohibitively expensive even when parallelized over a large number of worker nodes.
Many of the most desirable methods in machine learning and statistics exhibit these kinds of all-pairs or, more generally, all-tuples computations; accordingly, their application in the Big Data setting may seem beyond hope. However, a new algorithmic strategy inspired by breakthroughs in computational physics has shown great promise for a wide class of computations dubbed generalized N-body problems (GNBPs). This strategy, which involves the simultaneous traversal of multiple space-partitioning trees, has been applied to a succession of well-known learning methods, accelerating each asymptotically and by orders of magnitude. Examples of these include all-k-nearest-neighbors search, k-nearest-neighbors classification, k-means clustering, EM for mixtures of Gaussians, kernel density estimation, kernel discriminant analysis, kernel machines, particle filters, the n-point correlation, and many others. For each of these problems, no overall faster algorithms are known. Further, these dual- and multi-tree algorithms compute either exact results or approximations to within specified error bounds, a rarity amongst fast methods.
This dissertation aims to unify a family of GNBPs under a common framework in order to ease implementation and future study. We start by formalizing the problem class and then describe a general algorithm, the generalized fast multipole method (GFMM), capable of solving all problems that fit the class, though with varying degrees of speedup. We then show O(N) and O(log N) theoretical run-time bounds that may be obtained under certain conditions. As a corollary, we derive the tightest known general-dimensional run-time bounds for exact all-nearest-neighbors and several approximated kernel summations.
Next, we implement a number of these algorithms in a commercial database, empirically demonstrating dramatic asymptotic speedup over their conventional SQL implementations. Lastly, we implement a fast, parallelized algorithm for kernel discriminant analysis and apply it to a large dataset (40 million points in 4D) from the Sloan Digital Sky Survey, identifying approximately one million quasars with high accuracy. This exceeds the previous largest catalog of quasars in size by a factor of ten and has since been used in a follow-up study to confirm the existence of dark energy.
|
325 |
Learning algorithms for sparse classificationSanchez Merchante, Luis Francisco 07 June 2013 (has links) (PDF)
This thesis deals with the development of estimation algorithms with embedded feature selection the context of high dimensional data, in the supervised and unsupervised frameworks. The contributions of this work are materialized by two algorithms, GLOSS for the supervised domain and Mix-GLOSS for unsupervised counterpart. Both algorithms are based on the resolution of optimal scoring regression regularized with a quadratic formulation of the group-Lasso penalty which encourages the removal of uninformative features. The theoretical foundations that prove that a group-Lasso penalized optimal scoring regression can be used to solve a linear discriminant analysis bave been firstly developed in this work. The theory that adapts this technique to the unsupervised domain by means of the EM algorithm is not new, but it has never been clearly exposed for a sparsity-inducing penalty. This thesis solidly demonstrates that the utilization of group-Lasso penalized optimal scoring regression inside an EM algorithm is possible. Our algorithms have been tested with real and artificial high dimensional databases with impressive resuits from the point of view of the parsimony without compromising prediction performances.
|
326 |
Evaluation of the impacts of municipal wastewater treatment on the receiving environment : a case study of the Olifantsvlei wastewater treatment plant in the Gauteng Province, South AfricaMothetha, Matome Lucky 03 1900 (has links)
South Africa is water scarce country with maximum rainfall received in the summer season which lasts for only three months (November, December and January); hence the water resources have to be protected. The municipal wastewater effluents are considered one of the environmental threats that impact the water quality of the streams. This study was conducted to assess the environmental impact that the wastewater effluent has on the Klip River system, the performance of the plant and also to assess the spatial and temporal variations of water quality along the Klip River system.The study focused mainly on historical data over a five period (2009 – 2013) years secondary data which was analysed by Johannesburg Water Ltd (Pty) and primary data were also collected and analysed using the standard methods of laboratory analysis. The standard methods used include Ion selective electrode, gravimetric techniques, iodemetric titration, membrane filtration method; colorimetric method, automated flow injection method and inductively coupled plasma atomic emission spectrometry (ICP – AES). The aim of collecting the primary data during the dry and wet seasons was to verify the secondary data. The data set was further analysed using multivariate techniques such as principal component analysis (PCA), factor analysis (FA) and discriminant analysis (DA) to determine the spatial and temporal variation of water quality. The data set using ten water quality parameters (ammonia, sulphates, Chlorine, Chemical Oxygen Demand, conductivity, Escherichia coli, sodium, nitrates, pH and suspended solids) was grouped into four sampling points (influent, effluent, downstream and upstream points) and four seasons.Discriminant analysis of water quality showed that out of ten water quality parameters analysed, only sulphates was a less significant parameter to discriminate between the sampling points. For the temporal variations, eight water quality parameters (ammonium, Chlorine, Conductivity, sodium, nitrates, pH, sulphates and suspended solids are the most significant parameters to discriminate between the four seasons. PCA/FA results highlighted similarities in terms of water quality loading between summer and winter seasons and between the winter and autumn seasons. Summer and winter seasons had strong positive loading in COD, ammonium, suspended solids and E. coli whereas the autumn and spring seasons had strong positive loading in sodium, chlorine and pH. The study further highlighted that the Olifantsvlei Wastewater Treatment Works (WWTW) is effectively treating the wastewater up to the required standards before discharging them into the Klip River system. This study concludes that the Olifantsvlei WWTW does not contribute significant loads of pollutants into the Klip river system. / Environmental Sciences / M. Sc. (Environmental Science)
|
327 |
Statistical modelling by neural networksFletcher, Lizelle 30 June 2002 (has links)
In this thesis the two disciplines of Statistics and Artificial Neural Networks
are combined into an integrated study of a data set of a weather modification
Experiment.
An extensive literature study on artificial neural network methodology has
revealed the strongly interdisciplinary nature of the research and the applications
in this field.
An artificial neural networks are becoming increasingly popular with data
analysts, statisticians are becoming more involved in the field. A recursive
algoritlun is developed to optimize the number of hidden nodes in a feedforward
artificial neural network to demonstrate how existing statistical techniques
such as nonlinear regression and the likelihood-ratio test can be applied in
innovative ways to develop and refine neural network methodology.
This pruning algorithm is an original contribution to the field of artificial
neural network methodology that simplifies the process of architecture selection,
thereby reducing the number of training sessions that is needed to find
a model that fits the data adequately.
[n addition, a statistical model to classify weather modification data is developed
using both a feedforward multilayer perceptron artificial neural network
and a discriminant analysis. The two models are compared and the effectiveness
of applying an artificial neural network model to a relatively small
data set assessed.
The formulation of the problem, the approach that has been followed to
solve it and the novel modelling application all combine to make an original
contribution to the interdisciplinary fields of Statistics and Artificial Neural
Networks as well as to the discipline of meteorology. / Mathematical Sciences / D. Phil. (Statistics)
|
328 |
Human locomotion analysis, classification and modeling of normal and pathological vertical ground reaction force signals in elderly / Analyse, classification et modélisation de la locomotion humaine : application a des signaux GRF sur une population âgéeAlkhatib, Rami 12 July 2016 (has links)
La marche est définie par des séquences de gestes cycliques et répétées. Il a été déjà montré que la vitesse et la variabilité de ces séquences peuvent révéler des aptitudes ou des défaillances motrices. L’originalité de ce travail est alors d’analyser et de caractériser les foulées de sujets âgés à partir des signaux de pression issus de semelles instrumentées lors de la marche, au moyen d’outils de traitement du signal. Une étude préliminaire, sur les signaux de pression générés lors de la marche, nous a permis de mettre en évidence le caractère cyclo-stationnaire de ces signaux. Ces paramètres sont testées sur une population de 47 sujets. Tout d'abord, nous avons commencé par un prétraitement des signaux et nous avons montré dans la première de cette thèse que le filtrage peut éliminer une partie vitale du signal. C’est pourquoi un filtre adaptatif basé sur la décomposition en mode empirique a été conçu. Les points de retournement ont été filtrés ensuite en utilisant une technique temps-fréquence appelée «synochronosqueezing». Nous avons également montré que le contenu des signaux de force de marche est fortement affecté par des paramètres inquantifiables tels que les tâches cognitives qui les rendent difficiles à normaliser. C’est pourquoi les paramètres extraits de nos signaux sont tous dérivées par une comparaison inter-sujet. Par exemple, nous avons assimilé la différence dans la répartition de poids entre les pieds. Il est également recommandé dans ce travail de choisir le centre des capteurs plutôt que de compter sur la somme des forces issues du réseau de capteurs pour la classification. Ensuite, on a montré que l’hypothèse de la marche équilibrée et déséquilibrée peut améliorer les résultats de la classification. Le potentiel de cette hypothèse est montré à l'aide de la répartition du poids ainsi que le produit de l'âge × vitesse dans le premier classificateur et la corrélation dans le second classificateur. Une simulation de la série temporelle de VGRF basé sur une version modifiée du modèle de Markov non stationnaire, du premier ordre est ensuite dérivée. Ce modèle prédit les allures chez les sujets normaux et suffisamment pour les allures des sujets de Parkinson. On a trouvé que les trois modes: temps, fréquence et espace sont très utiles pour l’analyse des signaux de force, c’est pourquoi l’analyse de facteurs parallèles est introduite comme étant une méthode de tenseur qui peut être utilisée dans le futur / Walking is defined as sequences of repetitive cyclic gestures. It was already shown that the speed and the variability of these sequences can reveal abilities or motorskill failures. The originality of this work is to analyze and characterize the steps of elderly persons by using pressure signals. In a preliminary study, we showed that pressure signals are characterized by cyclostationarity. In this study, we intend to exploit the nonstationarity of the signals in a search for new indicators that can help in gait signal classification between normal and Parkinson subjects in the elderly population. These parameters are tested on a population of 47 subjects. First, we started with preprocessing the vertical ground reaction force (VGRF) signals and showed in this first part of the thesis that filtering can remove a vital part of the signal. That is why an adaptive filter based on empirical mode decomposition (EMD) was built. Turning points are filtered using synochronosqueezing of time-frequency representations of the signal. We also showed that the content of gait force signals is highly affected by unquantifiable parameter such as cognitive tasks which make them hard to be normalized. That is why features being extracted are derived from inter-subject comparison. For example we equated the difference in the load distribution between feet. It is also recommended in this work to choose the mid-sensor rather than relying on summation of forces from array of sensors for classification purposes. A hypothesis of balanced and unbalanced gait is verified to be potential in improving the classification accuracy. The power of this hypothesis is shown by using the load distribution and Age×Speed in the first classifier and the correlation in the second classifier. A time series simulation of VGRF based on a modified version of nonstationary- Markov model of first order is derived. This model successfully predict gaits in normal subjects and fairly did in Parkinson’s gait. We found out that the three modes: time, frequency and space are helpful in analyzing force signals that is why parallel factor analysis is introduced as a tensor method to be used in a future work
|
329 |
FITOSSOCIOLOGIA DE COMUNIDADES ARBÓREAS EM SAVANAS DO BRASIL CENTRAL / PHYTOSOCIOLOGY OF THE ARBOREAL COMMUNITIES IN SAVANNAS FROM CENTRAL BRAZILFinger, Zenesio 11 February 2008 (has links)
These studies were undertaken in the state of Mato Grosso, Brazil, in the area of Chapada dos Guimarães and Baixada Cuiabana, which are constituted of a high plateau and a big low plain, respectively, being limited to two areas covered by vegetation with a savannic physiognomy, type Cerrado stricto sensu. Considering the hypothesis that the knowledge both of the biotic and abiotic components of the scenery and their interrelations allows a better understanding of the environmental dynamic, this dissertation had as objectives to characterize the savanna communities' arboreal stratum floristically and phytosociologically, concerning their richness, phytosociological structures and diversity; to identify floristic groupings through varied statistical techniques, representing them by dendrograms; to select species which are really able to make discrimination among the groups; to obtain some discriminant functions to allow classification and reclassification of specimen units, in the groups, to which they have more probability of belonging; to analyze and to characterize the obtained groups; to determine the patterns of distribution of the species of trees by the analysis of correlations of environmental variables with the distribution of the species and plots in the communities being studied; to determine the similarity indexes among the floristic groups and to compare themselves and, finally, to test methods of assorted statistical analysis for application in studies of vegetable communities. Data of vegetation were obtained by the method of multiple plots, with size of 20 X 20 m (400 m2), randomly disposed in each one of the areas being studied. 82 plots were randomly installed. In each one of the 82 patternless units, the circumferences of all the arboreal plants with perimeter to 0,30 m from the level of the soil (PAB) larger or equal to 15,7 cm (DAB 5,0 cm) and the total height of the plants were obtained. In the core of each plot, for determination of the chemical and textural variables of the soil, simple samples of superficial soil were collected (0-30 cm depth). Species were organized according to the families recognized by Angiosperm Phylogeny Group II. The sampling sufficiency was obtained based on the analysis of the curve of the collector. Phytosociological parameters were calculated for each formed group, with the purpose of characterizing them phytosociologically. Having as variables the Index of Covering Value (IVC) of the species, the classification was accomplished by the TWINSPAN (Two-Way Indicator Species Analysis) method, regarding the plots, with the objective of classifying them in floristic groups. The diversity was determined by the Shannon-Wienner and the Simpson Index. The discriminant analysis was undertaken through the STEPWISE method. Considering the matrix of presence and absence of the species in the groups, the floristic similarity was calculated among the groups by the Sorensen Index. To evaluate the hypothesis of the correlation existence between the distribution of the species and environmental variables, the canonical correspondence analysis was accomplished (CCA). The test of permutation of Monte Carlo was applied to verify the importance of the correlations between the emerging distribution patterns of the species and the environmental variables in final CCA. To determine the responsible environmental factors for the distribution of the species, the analysis of regression logistics was used. The Forward Stepwise (Wald) method was used for the sequential selection of the variables. By the species-area curve, it could be observed that, from the plot 75 (30.000 m2 out of the area used as sample), the curve is stabilized with the occurrence of 114 species in the 82 studied plots, distributed between 81 genera and 36 botanical families. The families better represented were Fabaceae, Myrtaceae and Vochysiaceae. The alpha diversity from the arboreal vegetation found in the area being studied was of 4,033 considering the Shannon-Wiener Index and of 0,975 considering the Simpson Index, representing a great floristic diversity. The divisions generated
by the classification through the TWINSPAN method separated the plots into four groups: Group 1 Myrcia albo-tomentosa Camb. Association; Group 2 Pterodon emarginatus Vog. Association ; Group 3 Curatella americana L. Association; and Group 4 Qualea multiflora Mart. Association. In the discriminant analysis, 100% of the plots were classified correctly in the Groups 1, 2, 3 and 4, indicating precision of the grouping technique used. The largest similarity could be observed in the Groups 2 and 3, whose Sorensen Index was close to 1 (0,7310). In the four floristic groups, Fabaceae, Myrtaceae, Vochysiaceae, Annonaceae and Apocynaceae families were the most representative floristically in terms of genera and species. In CCA the correlations of the environmental variables with the first ordination axis were, in decreasing order of absolute values, saturation for aluminum, altitude s.n.m., saturation of bases, saturation for magnesium, relationship magnesium/potassium, saturation for hydrogen, potassium tenor, pH(H2O) and relationship calcium/potassium. The saturation for calcium variable presented very weak correlation with the first axis, however, with the second ordination axis, it was very strong. In the diagram of ordination of the plots, the four floristic groups were discriminated in sections different from the diagram, reinforcing their visualization as much defined habitats and with composition of particular species, resulting in clear separation of the four soil classes previously identified. The logistic regression analysis was useful to prove the results obtained from CCA, concerning the environmental variables which determined the distribution of the indicative species of the floristic groups in the studied communities. / Estes estudos foram desenvolvidos no estado de Mato Grosso, Brasil, na região de Chapada dos Guimarães e Baixada Cuiabana, que compreendem, respectivamente, um alto platô e uma grande planície baixa, restringindo-se a duas áreas cobertas por vegetação com fisionomia savânica do tipo Cerrado stricto sensu. Partindo-se da hipótese de que o conhecimento tanto dos componentes bióticos e abióticos da paisagem como de suas inter-relações permite um melhor entendimento da dinâmica ambiental, o presente estudo teve como objetivos caracterizar o estrato arbóreo das comunidades de savana estudadas, florística e fitossociologicamente, quanto a riqueza, estrutura fitossociológica e diversidade; identificar agrupamentos florísticos, por meio de técnicas estatísticas multivariadas, representando-os por meio de dendrograma; selecionar espécies com poder real de discriminação entre os grupos; obter funções discriminantes que permitam classificar e reclassificar unidades amostrais, nos grupos, para os quais têm maior probabilidade de pertencerem; analisar e caracterizar os grupos obtidos; determinar os padrões de distribuição das espécies de árvores, por meio da análise de correlações de variáveis ambientais com a distribuição das espécies e parcelas nas comunidades estudadas; determinar os índices de similaridade entre os grupos florísticos obtidos e compará-los; e testar métodos de análise estatística multivariada para aplicação em estudos de comunidades vegetais. Os dados da vegetação foram obtidos empregando-se o método de parcelas múltiplas, com tamanho de 20 X 20 m (400 m2), dispostas aleatoriamente em cada uma das áreas de estudos. Foram instaladas aleatoriamente 82 parcelas. Em cada uma das 82 unidades amostrais, foram obtidas as circunferências de todos as plantas arbóreas com perímetro a 0,30 m do nível do solo (PAB) maior ou igual a 15,7 cm (DAB 5,0 cm), e a altura total das plantas. No centro de cada parcela, para determinação das variáveis químicas e texturais do solo, coletaram-se amostras simples de solo superficial (0-30 cm de profundidade). As espécies foram organizadas de acordo com as famílias reconhecidas pelo Angiosperm Phylogeny Group II. A suficiência de amostragem foi obtida com base na análise da curva do coletor. Os parâmetros fitossociológicos foram calculados para cada grupo formado, com a finalidade de caracterizá-los fitossociológicamente. Tendo como variáveis o Índice de Valor de Cobertura (IVC) das espécies, foi realizada a classificação, por meio do método TWINSPAN (Two-Way Indicator Species Analisys), com relação às parcelas, com o objetivo de classificá-las em grupos florísticos. A diversidade foi determinada por meio do Índice de Shannon-Wienner e de Simpson. Realizou-se a análise discriminante por meio do método STEPWISE. A partir da matriz de presença e ausência das espécies nos grupos, foi calculada a similaridade florística entre os grupos, por meio do Índice de Sorensen. Para avaliar a hipótese da existência de correlação entre a distribuição das espécies e variáveis ambientais, foi realizada a análise de correspondência canônica (CCA). Foi aplicado o teste de permutação de Monte Carlo para verificar a significância das correlações entre os padrões de distribuição emergentes das espécies e as variáveis ambientais na CCA final. Para determinar os fatores ambientais responsáveis pela distribuição das espécies, foi utilizada a análise de regressão logística. À seleção seqüencial das variáveis foi utilizado o método Forward Stepwise (Wald). Pela curva espécie-área, pode-se observar que, a partir da parcela 75 (30.000 m2 da área amostrada), a curva estabiliza-se com a ocorrência de 114 espécies nas 82 parcelas estudadas, distribuídas entre 81 gêneros e 36 famílias botânicas. As famílias mais bem representadas foram Fabaceae, Myrtaceae e Vochysiaceae. A diversidade alfa da vegetação arbórea encontrada na área estudada foi de 4,033 pelo índice de Shannon-Wiener e de 0,975 pelo de Simpson, indicando alta diversidade florística. As divisões geradas pela classificação por meio do método TWINSPAN separaram as parcelas em quatro grupos. Grupo 1 - Associação Myrcia albo-tomentosa Camb.; Grupo 2 - Associação Pterodon emarginatus Vog.; Grupo 3 - Associação Curatella americana L.; e Grupo 4 - Associação Qualea multiflora Mart.. Na análise discriminante, observou-se que 100% das parcelas foram classificadas corretamente nos grupos 1, 2, 3 e 4, indicando precisão da técnica de agrupamento utilizada. A maior similaridade se deu entre os grupos 2 e 3, cujo índice de Sorensen foi próximo de 1 (0,7310). Nos quatro grupos florísticos obtidos, as famílias Fabaceae, Myrtaceae, Vochysiaceae, Annonaceae e Apocynaceae foram as mais representativas florísticamente em número de gêneros e espécies. Na CCA, as correlações das variáveis ambientais com o primeiro eixo de ordenação foram, em ordem decrescente de valores absolutos, saturação por alumínio, altitude s.n.m., saturação de bases, saturação por magnésio, relação magnésio/potássio, saturação por hidrogênio, teor de potássio, pH(H2O) e relação cálcio/potássio. A variável saturação por cálcio apresentou correlação muito fraca com o primeiro eixo, entretanto, com o segundo eixo de ordenação, foi muito forte. No diagrama de ordenação das parcelas, os quatro grupos florísticos foram discriminados em setores diferentes do diagrama, reforçando a visualização dos mesmos como hábitats bem definidos e com composição de espécies particular, resultando em clara separação das quatro classes de solo identificadas previamente. A análise de regressão logística comprovou os resultados obtidos da CCA, em relação às variáveis ambientais que determinaram a distribuição das espécies indicadoras dos grupos florísticos nas comunidades estudadas.
|
330 |
Novas estratégias para seleção de variáveis por intervalos em problemas de classificaçãoFernandes, David Douglas de Sousa 26 August 2016 (has links)
Submitted by Maike Costa (maiksebas@gmail.com) on 2017-06-20T13:50:43Z
No. of bitstreams: 1
arquivototal.pdf: 7102668 bytes, checksum: abe19d798ad952073affbf4950f62d29 (MD5) / Made available in DSpace on 2017-06-20T13:50:43Z (GMT). No. of bitstreams: 1
arquivototal.pdf: 7102668 bytes, checksum: abe19d798ad952073affbf4950f62d29 (MD5)
Previous issue date: 2016-08-26 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / In Analytical Chemistry it has been recurring in the literature the use of analytical signals recorded on multiple sensors combined with subsequent chemometric modeling for developing new analytical methodologies. For this purpose, it uses generally multivariate instrumental techniques as spectrometry ultraviolet-visible or near infrared, voltammetry, etc. In this scenario, the analyst is faced with the option of selecting individual variables or variable intervals so to avoid or reduce multicollinearity problems. A well-known strategy for selection of variable intervals is to divide the set of instrumental responses into equal width intervals and select the best interval based on the performance of the prediction of a unique range in the regression by Partial Least Squares (iPLS). On the other hand, the use of interval selection for classification purposes has received relatively little attention. A common practice is to use the iPLS regression method with the coded class indices as response variables to be predicted; that is the basic idea behind the release of the Discriminant Analysis by Partial Least Squares (PLS-DA) for classification. In other words, interval selection for classification purposes has no development of native functions (algorithms). Thus, in this work it is proposed two new strategies in classification problems using interval selection by the Successive Projections Algorithm. The first strategy is named Successive Projections Algorithm for selecting intervals in Discriminant Analysis Partial Least Squares (iSPA-PLS-DA), while the second strategy is called Successive Projections Algorithm for selecting intervals in Soft and Independent Modeling by Class Analogy (iSPA-SIMCA). The performance of the proposed algorithms was evaluated in three case studies: classification of vegetable oils according to the type of raw material and the expiration date using data obtained by square wave voltammetry; classification of unadulterated biodiesel/diesel blends (B5) and adulterated with soybean oil (OB5) using spectral data obtained in the ultraviolet-visible region; and classification of vegetable oils with respect to the expiration date using spectral data obtained in the near infrared region. The proposed iSPA-PLS-DA and iSPA-SIMCA algorithms provided good results in the three case studies, with correct classification rates always greater than or equal to those obtained by PLS-DA and SIMCA models using all variables, iPLS-DA and iSIMCA with a single selected interval, as well as SPA-LDA and GA-LDA with selection of individual variables. Therefore, the proposed iSPA-PLS-DA and iSPA-SIMCA algorithms can be considered as promising approaches for use in classification problems employing interval selection. In a more general point of view, the possibility of using interval selection without loss of the classification accuracy can be considered a very useful tool for the construction of dedicated instruments (e.g. LED-based photometers) for use in routine and in situ analysis. / Em Química Analítica tem sido recorrente na literatura o uso de sinais analíticos registrados em múltiplos sensores combinados com posterior modelagem quimiométrica para desenvolvimento de novas metodologias analíticas. Para esta finalidade, geralmente se faz uso de técnicas instrumentais multivariadas como a espectrometrias no ultravioleta-visível ou no infravermelho próximo, voltametria, etc. Neste cenário, o analista se depara com a opção de selecionar variáveis individuais ou intervalos de variáveis de modo de evitar ou diminuir problemas de multicolinearidade. Uma estratégia bem conhecida para seleção de intervalos de variáveis consiste em dividir o conjunto de respostas instrumentais em intervalos de igual largura e selecionar o melhor intervalo com base no critério de desempenho de predição de um único intervalo em regressão por Mínimos Quadrados Parciais (iPLS). Por outro lado, o uso da seleção de intervalo para fins de classificação tem recebido relativamente pouca atenção. Uma prática comum consiste em utilizar o método de regressão iPLS com os índices de classe codificados como variáveis de resposta a serem preditos, que é a idéia básica por trás da versão da Análise Discriminante por Mínimos Quadrados Parciais (PLS-DA) para a classificação. Em outras palavras, a seleção de intervalos para fins de classificação não possui o desenvolvimento de funções nativas (algoritmos). Assim, neste trabalho são propostas duas novas estratégias em problemas de classificação que usam seleção de intervalos de variáveis empregando o Algoritmo das Projeções Sucessivas. A primeira estratégia é denominada de Algoritmo das Projeções Sucessivas para seleção intervalos em Análise Discriminante por Mínimos Quadrados Parciais (iSPA-PLS-DA), enquanto a segunda estratégia é denominada de Algoritmo das Projeções Sucessivas para a seleção de intervalos em Modelagem Independente e Flexível por Analogia de Classe (iSPA-SIMCA). O desempenho dos algoritmos propostos foi avaliado em três estudos de casos: classificação de óleos vegetais com relação ao tipo de matéria-prima e ao prazo de validade utilizando dados obtidos por voltametria de onda quadrada; classificação de misturas biodiesel/diesel não adulteradas (B5) e adulteradas com óleo de soja (OB5) empregando dados espectrais obtidos na região do ultravioleta-visível; e classificação de óleos vegetais com relação ao prazo de validade usando dados espectrais obtidos na região do infravermelho próximo. Os algoritmos iSPA-PLS-DA e iSPA-SIMCA propostos forneceram bons resultados nos três estudos de caso, com taxas de classificação corretas sempre iguais ou superiores àquelas obtidas pelos modelos PLS-DA e SIMCA utilizando todas as variáveis, iPLS-DA e iSIMCA com um único intervalo selecionado, bem como SPA-LDA e GA-LDA com seleção de variáveis individuais. Portanto, os algoritmos iSPA-PLS-DA e iSPA-SIMCA propostos podem ser consideradas abordagens promissoras para uso em problemas de classificação empregando seleção de intervalos de variáveis. Num contexto mais geral, a possibilidade de utilização de seleção de intervalos de variáveis sem perda da precisão da classificação pode ser considerada uma ferramenta bastante útil para a construção de instrumentos dedicados (por exemplo, fotômetros a base de LED) para uso em análise de rotina e de campo.
|
Page generated in 0.0952 seconds