Global ETD Search

1	Some problems in high dimensional data analysis Pham, Tung Huy January 2010 (has links) The bloom of economics and technology has had an enormous impact on society. Along with these developments, human activities nowadays produce massive amounts of data that can be easily collected for relatively low cost with the aid of new technologies. Many examples can be mentioned here including data from web term-document data, sensor arrays, gene expression, finance data, imaging and hyperspectral analysis. Because of the enormous amount of data from various different and new sources, more and more challenging scientific problems appear. These problems have changed the types of problems which mathematical scientists work. / In traditional statistics, the dimension of the data, p say, is low, with many observations, n say. In this case, classical rules such as the Central Limit Theorem are often applied to obtain some understanding from data. A new challenge to statisticians today is dealing with a different setting, when the data dimension is very large and the number of observations is small. The mathematical assumption now could be p > n, or even p goes to infinity and n fixed in many cases, for example, there are few patients with many genes. In these cases, classical methods fail to produce a good understanding of the nature of the problem. Hence, new methods need to be found to solve these problems. Mathematical explanations are also needed to generalize these cases. / The research preferred in this thesis includes two problems: Variable selection and Classification, in the case where the dimension is very large. The work on variable selection problems, in particular the Adaptive Lasso was completed by June 2007 and the research on classification has been carried out through out 2008 and 2009. The research on the Dantzig selector and the Lasso were finished in July 2009. Therefore, this thesis is divided into two parts. In the first part of the thesis we study the Adaptive Lasso, the Lasso and the Dantzig selector. In particular, in Chapter 2 we present some results for the Adaptive Lasso. Chapter 3 will provides two examples that show that neither the Dantzig selector or the Lasso is definitely better than the other. The second part of the thesis is organized as follows. In Chapter 5, we shall construct the model setting. In Chapter 6, we summarize the results of the scaled centroid-based classifier. We also prove some results on the scaled centroid-based classifier. Because there are similarities between the Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD) classifiers, Chapter 8 introduces a class of distance-based classifiers that could be considered a generalization of the SVM and DWD classifiers. Chapters 9 and 10 are about the SVM and DWD classifiers. Chapter 11 demonstrates the performance of these classifiers on simulated data sets and some cancer data sets.
2	Some problems in high dimensional data analysis Pham, Tung Huy January 2010 (has links) The bloom of economics and technology has had an enormous impact on society. Along with these developments, human activities nowadays produce massive amounts of data that can be easily collected for relatively low cost with the aid of new technologies. Many examples can be mentioned here including data from web term-document data, sensor arrays, gene expression, finance data, imaging and hyperspectral analysis. Because of the enormous amount of data from various different and new sources, more and more challenging scientific problems appear. These problems have changed the types of problems which mathematical scientists work. / In traditional statistics, the dimension of the data, p say, is low, with many observations, n say. In this case, classical rules such as the Central Limit Theorem are often applied to obtain some understanding from data. A new challenge to statisticians today is dealing with a different setting, when the data dimension is very large and the number of observations is small. The mathematical assumption now could be p > n, or even p goes to infinity and n fixed in many cases, for example, there are few patients with many genes. In these cases, classical methods fail to produce a good understanding of the nature of the problem. Hence, new methods need to be found to solve these problems. Mathematical explanations are also needed to generalize these cases. / The research preferred in this thesis includes two problems: Variable selection and Classification, in the case where the dimension is very large. The work on variable selection problems, in particular the Adaptive Lasso was completed by June 2007 and the research on classification has been carried out through out 2008 and 2009. The research on the Dantzig selector and the Lasso were finished in July 2009. Therefore, this thesis is divided into two parts. In the first part of the thesis we study the Adaptive Lasso, the Lasso and the Dantzig selector. In particular, in Chapter 2 we present some results for the Adaptive Lasso. Chapter 3 will provides two examples that show that neither the Dantzig selector or the Lasso is definitely better than the other. The second part of the thesis is organized as follows. In Chapter 5, we shall construct the model setting. In Chapter 6, we summarize the results of the scaled centroid-based classifier. We also prove some results on the scaled centroid-based classifier. Because there are similarities between the Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD) classifiers, Chapter 8 introduces a class of distance-based classifiers that could be considered a generalization of the SVM and DWD classifiers. Chapters 9 and 10 are about the SVM and DWD classifiers. Chapter 11 demonstrates the performance of these classifiers on simulated data sets and some cancer data sets.
3	Determinação de hidrocarbonetos majoritarios presentes no gas natural utilizando espectroscopia no infravermelho proximo e calibração multivariada / Determination of major hydrocarbons in natural gas using near infrared spectroscopy and chemometrics Franco, Camila Manara 10 March 2008 (has links) Orientador: Jarbas Jose Rodrigues Rohwedder / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Quimica / Made available in DSpace on 2018-08-12T12:27:30Z (GMT). No. of bitstreams: 1 Franco_CamilaManara_M.pdf: 1935277 bytes, checksum: 425b8e7a22fc4b566bca70f2763ad362 (MD5) Previous issue date: 2008 / Resumo: Através da Espectroscopia de Infravermelho Próximo (NIR), auxiliada por quimiometria, foram desenvolvidos modelos de calibração para determinar a concentração de hidrocarbonetos majoritários em misturas gasosas cujas concentrações são semelhantes a aquelas observadas em gás natural. Os espectros foram obtidos em dois diferentes espectrofotômetros NIR construídos no próprio laboratório, os quais empregavam células de caminho óptico fixo e variável. Diferentes conjuntos de amostras foram preparados de forma a reproduzir a variabilidade de concentração de metano, etano, propano e butano encontrada nas diversas fontes de gás natural. A análise de amostras certificadas, através dos modelos de calibração, apresentou valores para a raiz do erro médio quadrático de previsão (RMSEP) iguais a 1,07, 0,21, 0,22 e 0,14 % (v/v) na determinação de metano, etano, propano e butano, respectivamente. A previsão do gás metano apresentou melhor repetibilidade quanto realizada pela espectroscopia NIR do que com a técnica padrão, cromatografia gasosa. Visando a possibilidade da construção de um espectrofotômetro NIR dedicado à análise de gás natural foi realizado um estudo de seleção de variáveis, cujo resultado indicou que, utilizando até 13% do número inicial de variáveis (280) é possível realizar a previsão dos hidrocarbonetos gasosos sem perda da qualidade analítica quando comparado à análise que utiliza a faixa espectral completa. Por meio dos comprimentos de onda selecionados, pode-se prever a concentração de metano, etano, propano e butano com valores de RMSEP iguais a 1,32, 0,41, 0,22 e 0,14 % (v/v), respectivamente. / Abstract: Near Infrared (NIR) Spectroscopy and Chemometrics were used to construct calibration models to determine the concentration of major hydrocarbons in gas mixtures in concentrations similar to those observed in natural gas. The spectra were obtained by two different NIR spectrophotometers made in the laboratory, one employing a cell of fixed and other with variable optical path. Different sample sets were prepared in order to mimic the variability of methane, ethane, propane and butane concentration found in natural gas obtained from various sources. The analysis of certified samples made by using the calibration models showed Root-Mean-Square Errors of Prediction (RMSEP) equal to 1.07, 0.21, 0.22 and 0.14% (v/v) for methane, ethane, propane and butane determination, respectively. The prediction of methane gas content showed better repeatability compared to the standard technique based on gas chromatography. To investigate the possibility of constructing an NIR spectrometer dedicated to the analysis of natural gas, the selection of variables was evaluated. The results indicated that, by using up to 13% of the initial variables, the prediction of hydrocarbon gases is achieved with the same quality when compared to the results obtained using the full spectral range. Employing the selected wavelengths, it is possible to predict the concentration of methane, ethane, propane and butane with values of RMSEP equal to 1.32, 0.41, 0.22 and 0.14% (v / v), respectively. / Mestrado / Quimica Analitica / Mestre em Química Gás natural Seleção de variaveis NIR Spectroscopy Natural gas Variables selection PLS
4	Bio-Inspired Artificial Intelligence Approach for Reinforced Concrete Block Shear Wall System Response Predictions Elgamel, Hana January 2022 (has links) Reinforced concrete block shear walls (RCBSWs) are used as seismic force resisting systems in low- and medium-rise buildings. However, attributed to their nonlinear behavior and composite material nature, accurate prediction of their seismic performance relying only on mechanics is challenging. This study introduces multi-gene genetic programming (MGGP)— a class of bio-inspired artificial intelligence, to uncover the complexity of RCBSW behaviors and develop simplified procedures for predicting the full backbone curve of flexure-dominated fully grouted RCBSWs under cyclic loading. A piecewise linear backbone curve was developed using five secant stiffness expressions associated with cracking, yielding, 80% ultimate, ultimate, and 20% strength degradation (i.e., post-peak stage) derived through controlled MGGP. Based on the experimental results of large-scale cyclically loaded RCBSWs, compiled from previously reported studies, a variable selection procedure was performed to identify the most influential variable subset governing wall behaviors. Utilizing individual wall results, the MGGP stiffness expressions were first trained and tested, and their accuracy was subsequently compared to that of existing models employing various statistical measures. In addition, the predictability of the developed backbone model was assessed at the system-level against experimental results of two two-story buildings available in the literature. The outcomes obtained from this study demonstrate the power of MGGP approach in addressing the complexity of the cyclic behavior of RCBSWs at both component- and system-level—offering an efficient prediction tool that can be adopted by relevant seismic design standards pertaining to RCBSW buildings. / Thesis / Master of Applied Science (MASc) backbone model fully grouted reinforced masonry walls seismic performance variables selection multigene genetic programming
5	Méthodes bayésiennes semi-paramétriques d'extraction et de sélection de variables dans le cadre de la dendroclimatologie / Semi-parametric Bayesian Methods for variables extraction and selection in a dendroclimatological context Guin, Ophélie 14 April 2011 (has links) Selon le Groupe Intergouvernemental d'experts sur l'Évolution du Climat (GIEC), il est important de connaitre le climat passé afin de replacer le changement climatique actuel dans son contexte. Ainsi, de nombreux chercheurs ont travaillé à l'établissement de procédures permettant de reconstituer les températures ou les précipitations passées à l'aide d'indicateurs climatiques indirects. Ces procédures sont généralement basées sur des méthodes statistiques mais l'estimation des incertitudes associées à ces reconstructions reste une difficulté majeure. L'objectif principal de cette thèse est donc de proposer de nouvelles méthodes statistiques permettant une estimation précise des erreurs commises, en particulier dans le cadre de reconstructions à partir de données sur les cernes d'arbres.De manière générale, les reconstructions climatiques à partir de mesures de cernes d'arbres se déroulent en deux étapes : l'estimation d'une variable cachée, commune à un ensemble de séries de mesures de cernes, et supposée climatique puis l'estimation de la relation existante entre cette variable cachée et certaines variables climatiques. Dans les deux cas, nous avons développé une nouvelle procédure basée sur des modèles bayésiens semi- paramétriques. Tout d'abord, concernant l'extraction du signal commun, nous proposons un modèle hiérarchique semi-paramétrique qui offre la possibilité de capturer les hautes et les basses fréquences contenues dans les cernes d'arbres, ce qui était difficile dans les études dendroclimatologiques passées. Ensuite, nous avons développé un modèle additif généralisé afin de modéliser le lien entre le signal extrait et certaines variables climatiques, permettant ainsi l'existence de relations non-linéaires contrairement aux méthodes classiques de la dendrochronologie. Ces nouvelles méthodes sont à chaque fois comparées aux méthodes utilisées traditionnellement par les dendrochronologues afin de comprendre ce qu'elles peuvent apporter à ces derniers. / As stated by the Intergovernmental Panel on Climate Change (IPCC), it is important to reconstruct past climate to accurately assess the actual climatic change. A large number of researchers have worked to develop procedures to reconstruct past temperatures or precipitation with indirect climatic indicators. These methods are generally based on statistical arguments but the estimation of uncertainties associated to these reconstructions remains an active research field in statistics and in climate studies. The main goal of this thesis is to propose and study novel statistical methods that allow a precise estimation of uncertainties when reconstructing from tree-ring measurements data. Generally, climatic reconstructions from tree-ring observations are based on two steps. Firstly, a hidden environmental hidden variable, common to a collection of tree-ring measurements series, has to be adequately inferred. Secondly, this extracted signal has to be explained with the relevant climatic variables. For these two steps, we have opted to work within a semi-parametric bayesian framework that reduces the number of assumptions and allows to include prior information from the practitioner. Concerning the extraction of the common signal, we propose a model which can catch high and low frequencies contained in tree-rings. This was not possible with previous dendroclimatological methods. For the second step, we have developed a bayesian Generalized Additive Model (GAM) to explore potential links between the extracted signal and some climatic variables. This allows the modeling of non-linear relationships among variables and strongly differs from past dendrochronological methods. From a statistical perspective, a new selection scheme for bayesien GAM was also proposed and studied. Estimation Bayésienne Modèles hiérarchiques Splines Séléctions de variables Dendrochronologie Reconstruction climatique Bayesian estimation Hierarchical models Spline Variables selection Dendrochronology Climatic reconstructions
6	Determina??o de par?metros (s?lidos sol?veis, pH e acidez titul?vel) em ameixas intactas usando espectroscopia no infravermelho pr?ximo e sele??o de comprimento de onda Costa, Rosangela C?mara 17 May 2013 (has links) Made available in DSpace on 2014-12-17T15:42:08Z (GMT). No. of bitstreams: 1 RosangelaCC_DISSERT.pdf: 3998203 bytes, checksum: f53c6aa79d4a6709116a504b71acbe98 (MD5) Previous issue date: 2013-05-17 / The aim of this study was to evaluate the potential of near-infrared reflectance spectroscopy (NIRS) as a rapid and non-destructive method to determine the soluble solid content (SSC), pH and titratable acidity of intact plums. Samples of plum with a total solids content ranging from 5.7 to 15%, pH from 2.72 to 3.84 and titratable acidity from 0.88 a 3.6% were collected from supermarkets in Natal-Brazil, and NIR spectra were acquired in the 714 2500 nm range. A comparison of several multivariate calibration techniques with respect to several pre-processing data and variable selection algorithms, such as interval Partial Least Squares (iPLS), genetic algorithm (GA), successive projections algorithm (SPA) and ordered predictors selection (OPS), was performed. Validation models for SSC, pH and titratable acidity had a coefficient of correlation (R) of 0.95 0.90 and 0.80, as well as a root mean square error of prediction (RMSEP) of 0.45?Brix, 0.07 and 0.40%, respectively. From these results, it can be concluded that NIR spectroscopy can be used as a non-destructive alternative for measuring the SSC, pH and titratable acidity in plums / O objetivo deste estudo foi avaliar a potencialidade da espectroscopia no infravermelho pr?ximo (NIRS) como um m?todo r?pido e n?o destrutivo para determina??o do teor de s?lidos sol?veis (TSS), pH e acidez titul?vel em ameixas intactas. Amostras de ameixa com teor de s?lidos sol?veis variando de 5,7 a 15%, pH de 2,72 a 3,84 e acidez de 0,88 a 3,6% foram adquiridas de supermercados em Natal - Brasil, e foram coletados espectros NIR no intervalo de 714-2500 nm. Uma compara??o de v?rias t?cnicas de calibra??o multivariada com rela??o ao pr?-processamento dos dados e algoritmos de sele??o de vari?veis, tais como m?nimos quadrados parciais por intervalos (iPLS), o algoritmo gen?tico (GA), algoritmo das proje??es sucessivas (SPA), e sele??o de previsores ordenados (OPS) foi realizada. Modelos de valida??o para o teor de s?lidos sol?veis, pH e acidez titul?vel tiveram um coeficiente de correla??o (R) de 0,95 a 0,90 e 0,80, bem como um erro m?dio quadr?tico de previs?o (RMSEP) de 0,45? Brix, 0,07 e 0,40%, respectivamente. A partir desses resultados, pode concluir-se que a espectroscopia NIR pode ser utilizada como uma alternativa n?o destrutiva para determina??o do teor de s?lidos sol?veis, pH e acidez em ameixas
7	Metodologias analíticas para a identificação de não conformidades em amostras de álcool combustível Silva, Adenilton Camilo da 27 August 2013 (has links) Submitted by Maike Costa (maiksebas@gmail.com) on 2016-05-03T14:00:03Z No. of bitstreams: 1 arquivo total.pdf: 4473867 bytes, checksum: e3130e71f9f870684d06304ce755007d (MD5) / Made available in DSpace on 2016-05-03T14:00:03Z (GMT). No. of bitstreams: 1 arquivo total.pdf: 4473867 bytes, checksum: e3130e71f9f870684d06304ce755007d (MD5) Previous issue date: 2013-08-27 / Conselho Nacional de Pesquisa e Desenvolvimento Científico e Tecnológico - CNPq / In Brazil, ethanol fuel is marketed in the hydrated form (HEAF– Hydrated Ethyl Alcohol Fuel). The adulterations found in HEAF can generate fines, and possible risks to society. With this perspective, this work proposes developing new analytical methods based on the use of infrared spectroscopy (NIR and MIR), and Cyclic Voltammetry (copper electrode), and chemometric pattern recognition techniques, to identify HEAF adulterations (with water or methanol). A total of 184 HEAF samples collected from different gasoline stations were analyzed. These samples were divided in three classes: (1) unadulterated, (2) adulterated with water (0.5% to 10%mm-1), and (3) adulterated with methanol (2% to 13% mm-1). Principal Components Analysis (PCA) was applied, permitting verification of a tendency to form clusters for unadulterated and adulterated samples. Classification models based on Linear Discriminant Analysis (LDA), with variable selection algorithms: SPA (Successive Projections Algorithm), GA (Genetic Algorithm), and SW (Stepwise) were employed. PLS-DA (Discriminant Analysis by Partial Least Squares) was applied to the data. Assessing the MIR spectra, 100% correct classification was achieved for all models. For NIR data, SPA-LDA and LDA-SW achieved a correct classification rate (RCC) of 84.4%, and 97.8%, respectively, while PLS-DA and GALDA correctly classified all test samples. In the evaluation of voltammetric data, as SPA-LDA as PLS-DA achieved a 93% RCC, but the GA-LDA and SW-LDA models showed better results, correctly classifying all test samples. The results suggest that the proposed methods are promising alternatives for identifying HEAF samples adulterated with water or methanol both quickly and securely. / No Brasil, uma das formas de comercialização do etanol combustível é na forma hidratada (AEHC - Álcool Etílico Hidratado Combustível). As adulterações encontradas nas amostras de AEHC são preocupantes, pois podem gerar prejuízos fiscais e à sociedade. Dentro dessa perspectiva, este trabalho propõe o desenvolvimento de novas metodologias analíticas baseadas no uso da espectroscopia no infravermelho (próximo - NIR e médio - MIR) e Voltametria Cíclica (com eletrodo cobre), em conjunto com técnicas quimiométricas de reconhecimento de padrões, visando à identificação das adulterações de AEHC com água ou metanol. Um total de 184 amostras de AEHC, coletadas de diferentes postos de combustíveis foram analisadas. Estas amostras foram divididas em três classes: (1) não adulteradas; (2) adulteradas com água (0,5% a 10,0%) e (3) adulteradas com metanol (2,0% a 13,0% m.m-1). A análise por componentes principais (PCA) foi aplicada aos dados, sendo possível verificar, principalmente, uma tendência à formação de agrupamentos das classes de amostras não adulteradas e adulteradas. Modelos de classificação foram baseados na análise discriminante linear (LDA) com prévia seleção de variáveis pelos algoritmos: SPA (Algoritmo das projeções sucessivas), GA (Algoritmo genético), SW (Stepwise). A técnica de PLS-DA (Análise discriminante pelos mínimos quadrados parciais) também foi aplicada nos dados. Avaliando os espectros MIR, 100% de acerto de classificação foram alcançados com todos os modelos. Para os dados NIR, utilizando SPA-LDA e SW-LDA houve uma taxa de classificação correta (TCC) de 84,4% e 97,8%, respectivamente, enquanto em PLS-DA e GA-LDA classificaram-se corretamente todas as amostras de teste. Na avaliação dos dados voltamétricos, tanto o SPA-LDA como o PLS-DA alcançaram uma TCC de 93%, mas os modelos GA-LDA e SW-LDA apresentaram melhores resultados, classificando corretamente todas as amostras de teste. Portanto, os métodos propostos são alternativas promissoras para a identificação, de forma rápida e segura, de adulteração em amostras de AEHC com água ou metanol. CIENCIAS EXATAS E DA TERRA::QUIMICA Espectroscopia no infravermelho Voltametria Classificação multivariada Seleção de variáveis Infrared spectroscopy Voltammetry Multivariate classification Variables selection
8	Avaliação de porfirinas na determinação simultânea de cátions por espectofotometria uv-vis e calibração multivariada / Assessment of porphyrins in the simultaneous determination of cations by uv-vis spectrophotometry and multivariate calibration Nino, Ivson de Carvalho 30 September 2014 (has links) Made available in DSpace on 2015-05-14T13:21:43Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 1991476 bytes, checksum: 27540abfa87688a61fed2296c0d22e04 (MD5) Previous issue date: 2014-09-30 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / In this work, we have investigated the use of porphyrins as non-selective complexing agents for the simultaneous determination of cations Pb2+, Zn2+, Cu2+, Mn2+, Co2+, and Hg2+ employing first-order multivariate calibration. At first, given its widespread availability and lower cost, the 5,10,15,20-tetraphenylporphyrin (H2TPP) was used in organic medium to complex cations in aqueous medium. Due to the instability of this system, the completion of this proposal was not achieved, but some important aspects are presented. Alternatively, we evaluated the use of 5,10,15,20-tetrakis (4-carboxyphenyl) porphyrin (H2TCPP) in aqueous medium. A 24-1 fractional factorial design indicated that the best conditions of metallation were pH 9, with the reaction performed in 10 minutes, at a temperature of 80°C. The best concentration of catalyst (Cd2+) was 5 x 10-8 mol L-1. A calibration set was constructed employing a Brereton design for six cations at five concentration levels. External validation was used with a set of ten samples containing random concentrations of analytes. Calibration models were constructed based on partial least-squares regression (PLS) and multiple linear regression (MLR) combined with variable selection by genetic algorithm (GA), or the successive projections algorithm (SPA). The method was employed in the analysis of mineral water samples and good apparent recoveries were obtained when spiked samples were predicted by SPA-MLR model. / Neste trabalho, investigou-se a utilização de porfirinas como agentes complexantes não-seletivos para o desenvolvimento de um método de determinação simultânea dos cátions Pb2+, Zn2+, Cu2+, Mn2+, Co2+ e Hg2+ empregando calibração multivariada de primeira ordem. Inicialmente, dada sua ampla disponibilidade e menor custo, tentou-se empregar a 5,10,15,20-tetrafenilporfirina (H2TPP) em meio orgânico para complexar os cátions em meio aquoso. Problemas de instabilidade do sistema impediram a conclusão dessa proposta, mas alguns aspectos importantes deste sistema são apresentados. Como alternativa, avaliou-se o uso da 5,10,15,20-tetraquis(4-carboxifenil)porfirina (H2TCPP) em meio aquoso. Um planejamento fracionário 24-1 indicou que as melhores condições de metalação foram pH 9, tempo de aquecimento de 10 minutos e temperatura de aquecimento de 80°C. A melhor concentração de catalisador (Cd2+) foi 5 x 10-8 mol L-1. O conjunto de calibração foi construído empregando um planejamento Brereton para os 6 cátions em 5 níveis de concentração. Foi utilizada validação externa com um conjunto de 10 amostras contendo concentrações aleatórias dos analitos. Foram construídos modelos baseados em regressão por mínimos quadrados parciais (PLS) e regressão linear múltipla (MLR) combinada à seleção de variáveis por algoritmo genético (GA), ou algoritmo das projeções sucessivas (SPA). O método foi empregado na análise de amostras de água mineral, sendo encontrados bons valores de recuperação aparente quando as amostras fortificadas foram previstas pelos modelos SPA-MLR. Determinação simultânea Metais Mesoporfirinas Calibração multivariada Sseleção de variáveis Simultaneous determination Metals Mesoporphyrins Multivariate calibration Variables selection CIENCIAS EXATAS E DA TERRA::QUIMICA
9	Forêts aléatoires et sélection de variables : analyse des données des enregistreurs de vol pour la sécurité aérienne / Random forests and variable selection : analysis of the flight data recorders for aviation safety Gregorutti, Baptiste 11 March 2015 (has links) De nouvelles réglementations imposent désormais aux compagnies aériennes d'établir une stratégie de gestion des risques pour réduire encore davantage le nombre d'accidents. Les données des enregistreurs de vol, très peu exploitées à ce jour, doivent être analysées de façon systématique pour identifier, mesurer et suivre l'évolution des risques. L'objectif de cette thèse est de proposer un ensemble d'outils méthodologiques pour répondre à la problématique de l'analyse des données de vol. Les travaux présentés dans ce manuscrit s'articulent autour de deux thèmes statistiques : la sélection de variables en apprentissage supervisé d'une part et l'analyse des données fonctionnelles d'autre part. Nous utilisons l'algorithme des forêts aléatoires car il intègre des mesures d'importance pouvant être employées dans des procédures de sélection de variables. Dans un premier temps, la mesure d'importance par permutation est étudiée dans le cas où les variables sont corrélées. Nous étendons ensuite ce critère pour des groupes de variables et proposons une nouvelle procédure de sélection de variables fonctionnelles. Ces méthodes sont appliquées aux risques d'atterrissage long et d'atterrissage dur, deux questions importantes pour les compagnies aériennes. Nous présentons enfin l'intégration des méthodes proposées dans le produit FlightScanner développé par Safety Line. Cette solution innovante dans le transport aérien permet à la fois le monitoring des risques et le suivi des facteurs qui les influencent. / New recommendations require airlines to establish a safety management strategy to keep reducing the number of accidents. The flight data recorders have to be systematically analysed in order to identify, measure and monitor the risk evolution. The aim of this thesis is to propose methodological tools to answer the issue of flight data analysis. Our work revolves around two statistical topics: variable selection in supervised learning and functional data analysis. The random forests are used as they implement importance measures which can be embedded in selection procedures. First, we study the permutation importance measure when the variables are correlated. This criterion is extended for groups of variables and a new selection algorithm for functional variables is introduced. These methods are applied to the risks of long landing and hard landing which are two important questions for airlines. Finally, we present the integration of the proposed methods in the software FlightScanner implemented by Safety Line. This new solution in the air transport helps safety managers to monitor the risks and identify the contributed factors. Forêts aléatoires Sélection de variables Mesure d'importance par permutation Corrélation Analyse des données fonctionnelles Sécurité aérienne Random forests Variables selection 519.5
10	Aplicações de técnicas multivariadas na área comercial de uma empresa de comunicação Moraes, Renan Manhabosco January 2017 (has links) A mudança de comportamento dos consumidores através do advento da tecnologia e das redes sociais gera um grande empoderamento dos mesmos, alterando substancialmente a forma de relacionamento das empresas com seu público final. Atentas a este mercado, as empresas de mídia passam por profundas mudanças, tanto do ponto de vista da entrega de conteúdo ao seu público, quanto no seu formato administrativo, estratégico e financeiro. Sendo assim, a presente dissertação apresenta abordagens apoiadas em técnicas multivariadas para composição de equipes comerciais e de remuneração dos times de venda de uma empresa de comunicação. No artigo 1, objetiva-se gerar um modelo para estimar a premiação comercial das equipes de venda das rádios do Grupo RBS. Para tanto, inicialmente geram-se agrupamentos das emissoras de rádio do Grupo RBS no estado do Rio Grande do Sul e de Santa Catarina com base nos seus perfis de similaridades. Para cada cluster gerado, gera-se uma regressão linear múltipla da premiação comercial validado através de validação cruzada por intermédio do R2 ajustado e Mean Absolute Percentage Error (MAPE). O segundo artigo aborda a clusterização dos top clientes do Grupo RBS e o impacto na composição das equipes comerciais por meio do método da seleção de variáveis. As 7 variáveis originais foram avaliadas através do método de seleção de variáveis “Omita uma variável por vez”; o melhor Silhouette Index (SI) médio, métrica utilizada para avaliar a qualidade dos agrupamentos gerados, foi obtido quando 3 variáveis foram retidas. Os agrupamentos gerados por tais variáveis refletem o comportamento de compra de mídia dos clientes; os agrupamentos foram considerados satisfatórios quando avaliados por especialistas do Grupo RBS. / The change in the behavior of consumers with the advent of technology and social networks generates a great empowerment of themselves, substantially altering the relationship form of companies to their final audience. Attentive to this market, media companies undergo profound changes, both from the point of view of delivering content to their audience, as well as in their administrative, strategic and financial format. Thus, the present dissertation presents approaches supported by multivariate techniques for the composition of commercial and remuneration teams of the sales group of a communication company. In article 1, the objective is to generate a model to estimate the commercial awards of the sales teams of the RBS Group radios. To do this, we initially generate groupings of radio stations from the RBS Group in the state of Rio Grande do Sul and Santa Catarina based on their profiles of similarities. For each cluster generated, a multiple linear regression of the commercial award is generated, validated through cross validation through the adjusted R2 and Mean Absolute Percentage Error (MAPE). The second article addresses the clustering of RBS Group top clients and the impact on the composition of business teams through the variable selection method. The original 7 variables were evaluated through the variable selection method "Omit one variable at a time"; the best Silhouette Index (SI) average, metric used to evaluate the quality of the generated clusters, was obtained when 3 variables were retained. Clusters generated by such variables reflect customers' buying behavior of media; the clusters were considered satisfactory when evaluated by RBS Group experts. Cluster Regressão linear Empresa de comunicação Vendas Clustering Top Customers Allocation of Sales Teams Variables Selection Radio Sales Awards from the RBS Group Predictive Model Linear Regression

Search results