Global ETD Search

71	Quantification vectorielle en grande dimension : vitesses de convergence et sélection de variables / High-dimensional vector quantization : convergence rates and variable selection Levrard, Clément 30 September 2014 (has links) Ce manuscrit étudie dans un premier temps la dépendance de la distorsion, ou erreur en quantification, du quantificateur construit à partir d'un n-échantillon d'une distribution de probabilité via l'algorithme des k-means. Plus précisément, l'objectif de ce travail est de donner des bornes en probabilité sur l'écart entre la distorsion de ce quantificateur et la plus petite distorsion atteignable parmi les quantificateurs, à nombre d'images k fixé, décrivant l'influence des divers paramètres de ce problème: support de la distribution de probabilité à quantifier, nombre d'images k, dimension de l'espace vectoriel sous-jacent, et taille de l'échantillon servant à construire le quantificateur k-mean. Après un bref rappel des résultats précédents, cette étude établit l'équivalence des diverses conditions existantes pour établir une vitesse de convergence rapide en la taille de l'échantillon de l'écart de distorsion considéré, dans le cas des distributions à densité, à une condition technique ressemblant aux conditions requises en classification supervisée pour l'obtention de vitesses rapides de convergence. Il est ensuite prouvé que, sous cette condition technique, une vitesse de convergence de l'ordre de 1/n pouvait être atteinte en espérance. Ensuite, cette thèse énonce une condition facilement interprétable, appelée condition de marge, suffisante à la satisfaction de la condition technique établie précédemment. Plusieurs exemples classiques de distributions satisfaisant cette condition sont donnés, tels les mélanges gaussiens. Si cette condition de marge se trouve satisfaite, une description précise de la dépendance de l'écart de distorsion étudié peut être donné via une borne en espérance: la taille de l'échantillon intervient via un facteur 1/n, le nombre d'images k intervient via différentes quantités géométriques associées à la distribution à quantifier, et de manière étonnante la dimension de l'espace sous-jacent semble ne jouer aucun rôle. Ce dernier point nous a permis d'étendre nos résultats au cadre des espaces de Hilbert, propice à la quantification des courbes. Néanmoins, la quantification effective en grande dimension nécessite souvent en pratique une étape de réduction du nombre de variables, ce qui nous a conduit dans un deuxième temps à étudier une procédure de sélection de variables associée à la quantification. Plus précisément, nous nous sommes intéressés à une procédure de type Lasso adaptée au cadre de la quantification vectorielle, où la pénalité Lasso porte sur l'ensemble des points images du quantificateur, dans le but d'obtenir des points images parcimonieux. Si la condition de marge introduite précédemment est satisfaite, plusieurs garanties théoriques sont établies concernant le quantificateur issu d'une telle procédure, appelé quantificateur Lasso k-means, à savoir que les points images de ce quantificateur sont proches des points images d'un quantificateur naturellement parcimonieux, réalisant un compromis entre erreur en quantification et taille du support des points images, et que l'écart en distorsion du quantificateur Lasso k-means est de l'ordre de 1/n^(1/2) en la taille de l'échantillon. Par ailleurs la dépendance de cette distorsion en les différents autres paramètres de ce problème est donnée explicitement. Ces prédictions théoriques sont illustrées par des simulations numériques confirmant globalement les propriétés attendues d'un tel quantificateur parcimonieux, mais soulignant néanmoins quelques inconvénients liés à l'implémentation effective de cette procédure. / The distortion of the quantizer built from a n-sample of a probability distribution over a vector space with the famous k-means algorithm is firstly studied in this thesis report. To be more precise, this report aims to give oracle inequalities on the difference between the distortion of the k-means quantizer and the minimum distortion achievable by a k-point quantizer, where the influence of the natural parameters of the quantization issue should be precisely described. For instance, some natural parameters are the distribution support, the size k of the quantizer set of images, the dimension of the underlying Euclidean space, and the sample size n. After a brief summary of the previous works on this topic, an equivalence between the conditions previously stated for the excess distortion to decrease fast with respect to the sample size and a technical condition is stated, in the continuous density case. Interestingly, this condition looks like a technical condition required in statistical learning to achieve fast rates of convergence. Then, it is proved that the excess distortion achieves a fast convergence rate of 1/n in expectation, provided that this technical condition is satisfied. Next, a so-called margin condition is introduced, which is easier to understand, and it is established that this margin condition implies the technical condition mentioned above. Some examples of distributions satisfying this margin condition are exposed, such as the Gaussian mixtures, which are classical distributions in the clustering framework. Then, provided that this margin condition is satisfied, an oracle inequality on the excess distortion of the k-means quantizer is given. This convergence result shows that the excess distortion decreases with a rate 1/n and depends on natural geometric properties of the probability distribution with respect to the size of the set of images k. Suprisingly the dimension of the underlying Euclidean space seems to play no role in the convergence rate of the distortion. Following the latter point, the results are directly extended to the case where the underlying space is a Hilbert space, which is the adapted framework when dealing with curve quantization. However, high-dimensional quantization often needs in practical a dimension reduction step, before proceeding to a quantization algorithm. This motivates the following study of a variable selection procedure adapted to the quantization issue. To be more precise, a Lasso type procedure adapted to the quantization framework is studied. The Lasso type penalty applies to the set of image points of the quantizer, in order to obtain sparse image points. The outcome of this procedure is called the Lasso k-means quantizer, and some theoretical results on this quantizer are established, under the margin condition introduced above. First it is proved that the image points of such a quantizer are close to the image points of a sparse quantizer, achieving a kind of tradeoff between excess distortion and size of the support of image points. Then an oracle inequality on the excess distortion of the Lasso k-means quantizer is given, providing a convergence rate of 1/n^(1/2) in expectation. Moreover, the dependency of this convergence rate on different other parameters is precisely described. These theoretical predictions are illustrated with numerical experimentations, showing that the Lasso k-means procedure mainly behaves as expected. However, the numerical experimentations also shed light on some drawbacks concerning the practical implementation of such an algorithm. Quantification K-means Localisation Conditions de marge Inégalité oracle Sélection de variables Lasso Quantization K-means Localization Margin condition Oracle inequality Variable selection Lasso
72	Développement de représentations et d'algorithmes efficaces pour l'apprentissage statistique sur des données génomiques / Learning from genomic data : efficient representations and algorithms. Le Morvan, Marine 03 July 2018 (has links) Depuis le premier séquençage du génome humain au début des années 2000, de grandes initiatives se sont lancé le défi de construire la carte des variabilités génétiques inter-individuelles, ou bien encore celle des altérations de l'ADN tumoral. Ces projets ont posé les fondations nécessaires à l'émergence de la médecine de précision, dont le but est d'intégrer aux dossiers médicaux conventionnels les spécificités génétiques d'un individu, afin de mieux adapter les traitements et les stratégies de prévention. La traduction des variations et des altérations de l'ADN en prédictions phénotypiques constitue toutefois un problème difficile. Les séquenceurs ou puces à ADN mesurent plus de variables qu'il n'y a d'échantillons, posant ainsi des problèmes statistiques. Les données brutes sont aussi sujettes aux biais techniques et au bruit inhérent à ces technologies. Enfin, les vastes réseaux d'interactions à l'échelle des protéines obscurcissent l'impact des variations génétiques sur le comportement de la cellule, et incitent au développement de modèles prédictifs capables de capturer un certain degré de complexité.Cette thèse présente de nouvelles contributions méthodologiques pour répondre à ces défis.Tout d'abord, nous définissons une nouvelle représentation des profils de mutations tumorales, qui exploite leur position dans les réseaux d'interaction protéine-protéine. Pour certains cancers, cette représentation permet d'améliorer les prédictions de survie à partir des données de mutations, et de stratifier les cohortes de patients en sous-groupes informatifs. Nous présentons ensuite une nouvelle méthode d'apprentissage permettant de gérer conjointement la normalisation des données et l'estimation d'un modèle linéaire. Nos expériences montrent que cette méthode améliore les performances prédictives par rapport à une gestion séquentielle de la normalisation puis de l'estimation. Pour finir, nous accélérons l'estimation de modèles linéaires parcimonieux, prenant en compte des interactions deux à deux, grâce à un nouvel algorithme. L'accélération obtenue rend cette estimation possible et efficace sur des jeux de données comportant plusieurs centaines de milliers de variables originales, permettant ainsi d'étendre la portée de ces modèles aux données des études d'associations pangénomiques. / Since the first sequencing of the human genome in the early 2000s, large endeavours have set out to map the genetic variability among individuals, or DNA alterations in cancer cells. They have laid foundations for the emergence of precision medicine, which aims at integrating the genetic specificities of an individual with its conventional medical record to adapt treatment, or prevention strategies.Translating DNA variations and alterations into phenotypic predictions is however a difficult problem. DNA sequencers and microarrays measure more variables than there are samples, which poses statistical issues. The data is also subject to technical biases and noise inherent in these technologies. Finally, the vast and intricate networks of interactions among proteins obscure the impact of DNA variations on the cell behaviour, prompting the need for predictive models that are able to capture a certain degree of complexity. This thesis presents novel methodological contributions to address these challenges. First, we define a novel representation for tumour mutation profiles that exploits prior knowledge on protein-protein interaction networks. For certain cancers, this representation allows improving survival predictions from mutation data as well as stratifying patients into meaningful subgroups. Second, we present a new learning framework to jointly handle data normalisation with the estimation of a linear model. Our experiments show that it improves prediction performances compared to handling these tasks sequentially. Finally, we propose a new algorithm to scale up sparse linear models estimation with two-way interactions. The obtained speed-up makes this estimation possible and efficient for datasets with hundreds of thousands of main effects, thereby extending the scope of such models to the data from genome-wide association studies. Mutations Réseaux de gènes Normalisation par les quantiles LASSO avec interactions Mutations Gene networks Quantile normalisation Single Nucleotide Polymorphisms (SNPs) LASSO with pairwise interactions 570.15
73	LASSO與其衍生方法之特性比較 / Property comparison of LASSO and its derivative methods 黃昭勳, Huang, Jau-Shiun Unknown Date (has links) 本論文比較了幾種估計線性模型係數的方法，包括LASSO、Elastic Net、LAD-LASSO、EBLASSO和EBENet。有別於普通最小平方法，這些方法在估計模型係數的同時，能夠達到變數篩選，也就是刪除不重要的解釋變數，只將重要的變數保留在模型中。在現今大數據的時代，資料量有著愈來愈龐大的趨勢，其中不乏上百個甚至上千個解釋變數的資料，對於這樣的資料，變數篩選就顯得更加重要。本文主要目的為評估各種估計模型係數方法的特性與優劣，當中包含了兩種模擬研究與兩筆實際資料應用。由模擬的分析結果來看，每種估計方法都有不同的特性，沒有一種方法使用在所有資料都是最好的。 / In this study, we compare several methods for estimating coefficients of linear models, including LASSO, Elastic Net, LAD-LASSO, EBLASSO and EBENet. These methods are different from Ordinary Least Square (OLS) because they allow estimation of coefficients and variable selection simultaneously. In other words, these methods eliminate non-important predictors and only important predictors remain in the model. In the age of big data, quantity of data has become larger and larger. A datum with hundreds of or thousands of predictors is also common. For this type of data, variable selection is apparently more essential. The primary goal of this article is to compare properties of different variable selection methods as well as to find which method best fits a large number of data. Two simulation scenarios and two real data applications are included in this study. By analyzing results from the simulation study, we can find that every method enjoys different characteristics, and no standard method can handle all kinds of data. Elastic Net LASSO 懲罰函數迴歸變數篩選 Elastic Net LASSO Penalty function Regression Variable selection
74	Bio-statistical approaches to evaluate the link between specific nutrients and methylation patterns in a breast cancer case-control study nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) study / Approches bio-statistiques pour évaluer le lien entre nutriments et profils de méthylation du cancer du sein dans l’étude prospective Européenne sur le Cancer et la Nutrition (EPIC) Perrier, Flavie 13 September 2018 (has links) De par les centaines de milliers de données qui les caractérisent, les bases de données épigénétiques représentent actuellement un défi majeur. L’objectif principal de cette thèse est d’évaluer la performance d’outils statistiques développés pour les données de grande dimension, en explorant l’association entre facteurs alimentaires reliés au cancer du sein (CS) et méthylation de l’ADN dans la cohorte EPIC.Afin d’étudier les caractéristiques des données de méthylation, l’identification des sources systématiques de variabilité des mesures de méthylation a été effectuée par la méthode de la PC-PR2. Ainsi la performance de trois techniques de normalisation, très répandues pour corriger la part de variabilité non désirée, a été évaluée en quantifiant l’entendu de variabilité attribuée aux facteurs de laboratoire avant et après chaque méthode de correction.Une fois la méthode de normalisation la plus appropriée identifiée, la relation entre le folate, l’alcool et la méthylation de l’ADN a été analysée par le biais de trois approches : une analyse individuelle des sites CpG, une analyse de DMR et la régression fused lasso. Les deux dernières méthodes visent à identifier des régions spécifiques de l’épigénome grâce aux corrélations possibles entre les sites proches. La méthylation globale a aussi été utilisée pour étudier la relation entre méthylation et risque de CS.Grâce à une évaluation exhaustive d’outils statistiques révélant la complexité des données de méthylation de l’ADN, cette thèse offre un aperçu instructif de connaissances pour les études épigénétiques, avec une possibilité d’application de méthodologie similaire aux analyses d’autres types de données -omiques / Epigenetics data are challenging sets characterized by hundreds of thousands of features. The main objective of this thesis was to evaluate the performance of some of the existing statistical methods to handle sets of large dimension data, exploring the association between dietary factors related to breast cancer (BC) and DNA methylation within the EPIC study.In order to investigate the characteristics of epigenetics data, the identification of random and systematic sources of variability of methylation measurements was attempted, via the principal component partial R-square (PC-PR2) method. Using this technique, the performance of three popular normalization techniques to correct for unwanted sources of variability was evaluated by quantifying epigenetics variability attributed to laboratory factors before and after the application of each correction method.Once a suitable normalization procedure was identified, the association between alcohol intake, dietary folate and methylation levels was examined by means of three approaches: an analysis of individual CpG sites, of differentially methylated regions (DMRs) and using fused lasso regression. The last two methods aim at the identification of specific regions of the epigenome using the potential correlation between neighboring CpG sites. Global methylation levels were used to investigate the relationship between methylation and BC risk.By performing an exhaustive evaluation of the statistical tools used to disclose complexity of DNA methylation data, this thesis provides informative insights for studies focusing on epigenetics, with promising potentials to apply similar methodology to the analysis of other -omics data Epigénétique PC-PR2 Méthylation DMR Fused lasso Cancer du sein EPIC Epigenetics PC-PR2 Methylation DMR Fused lasso Breast cancer EPIC 570.15
75	[en] FORECASTING LARGE REALIZED COVARIANCE MATRICES: THE BENEFITS OF FACTOR MODELS AND SHRINKAGE / [pt] PREVISÃO DE MATRIZES DE COVARIÂNCIA REALIZADA DE ALTA DIMENSÃO: OS BENEFÍCIOS DE MODELOS DE FATORES E SHRINKAGE DIEGO SIEBRA DE BRITO 19 September 2018 (has links) [pt] Este trabalho propõe um modelo de previsão de matrizes de covariância realizada de altíssima dimensão, com aplicação para os componentes do índice S e P 500. Para lidar com o altíssimo número de parâmetros (maldição da dimensionalidade), propõe-se a decomposição da matriz de covariância de retornos por meio do uso de um modelo de fatores padrão (e.g. tamanho, valor, investimento) e uso de restrições setoriais na matriz de covariância residual. O modelo restrito é estimado usando uma especificação de vetores auto regressivos heterogêneos (VHAR) estimados com LASSO (Least Absolute Shrinkage and Selection Operator). O uso da metodologia proposta melhora a precisão de previsão em relação a benchmarks padrões e leva a melhores estimativas de portfólios de menor variância. / [en] We propose a model to forecast very large realized covariance matrices of returns, applying it to the constituents of the S and P 500 on a daily basis. To deal with the curse of dimensionality, we decompose the return covariance matrix using standard firm-level factors (e.g. size, value, profitability) and use sectoral restrictions in the residual covariance matrix. This restricted model is then estimated using Vector Heterogeneous Autoregressive (VHAR) models estimated with the Least Absolute Shrinkage and Selection Operator (LASSO). Our methodology improves forecasting precision relative to standard benchmarks and leads to better estimates of the minimum variance portfolios. [pt] PREVISAO [en] FORECASTING [pt] MODELO DE FATORES [en] FACTOR MODELS [pt] BIG DATA [en] BIG DATA [pt] LASSO [en] LASSO [pt] COVARIANCIA REALIZADA [en] REALIZED COVARIANCE [pt] ALOCACAO DE PORTFOLIO [en] PORTFOLIO ALLOCATION
76	On regularized estimation methods for precision and covariance matrix and statistical network inference Kuismin, M. (Markku) 14 November 2018 (has links) Abstract Estimation of the covariance matrix is an important problem in statistics in general because the covariance matrix is an essential part of principal component analysis, statistical pattern recognition, multivariate regression and network exploration, just to mention but a few applications. Penalized likelihood methods are used when standard estimates cannot be computed. This is a common case when the number of explanatory variables is much larger compared to the sample size (high-dimensional case). An alternative ridge-type estimator for the precision matrix estimation is introduced in Article I. This estimate is derived using a penalized likelihood estimation method. Undirected networks, which are connected to penalized covariance and precision matrix estimation and some applications related to networks are also explored in this dissertation. In Article II novel statistical methods are used to infer population networks from discrete measurements of genetic data. More precisely, Least Absolute Shrinkage and Selection Operator, LASSO for short, is applied in neighborhood selection. This inferred network is used for more detailed inference of population structures. We illustrate how community detection can be a promising tool in population structure and admixture exploration of genetic data. In addition, in Article IV it is shown how the precision matrix estimator introduced in Article I can be used in graphical model selection via a multiple hypothesis testing procedure. Article III in this dissertation contains a review of current tools for practical graphical model selection and precision/covariance matrix estimation. The other three publications have detailed descriptions of the fundamental computational and mathematical results which create a basis for the methods presented in these articles. Each publication contains a collection of practical research questions where the novel methods can be applied. We hope that these applications will help readers to better understand the possible applications of the methods presented in this dissertation. / Tiivistelmä Kovarianssimatriisin estimointi on yleisesti ottaen tärkeä tilastotieteen ongelma, koska kovarianssimatriisi on oleellinen osa pääkomponenttianalyysia, tilastollista hahmontunnistusta, monimuuttujaregressiota ja verkkojen tutkimista, vain muutamia sovellutuksia mainitakseni. Sakotettuja suurimman uskottavuuden menetelmiä käytetään sellaisissa tilanteissa, joissa tavanomaisia estimaatteja ei voida laskea. Tämä on tyypillistä tilanteessa, jossa selittävien muuttujien lukumäärä on hyvin suuri verrattuna otoskokoon (englanninkielisessä kirjallisuudessa tämä tunnetaan nimellä ”high dimensional case”). Ensimmäisessä artikkelissa esitellään vaihtoehtoinen harjanne (ridge)-tyyppinen estimaattori tarkkuusmatriisin estimointiin. Tämä estimaatti on johdettu käyttäen sakotettua suurimman uskottavuuden estimointimenetelmää. Tässä väitöskirjassa käsitellään myös suuntaamattomia verkkoja, jotka liittyvät läheisesti sakotettuun kovarianssi- ja tarkkuusmatriisin estimointiin, sekä joitakin verkkoihin liittyviä sovelluksia. Toisessa artikkelissa käytetään uusia tilastotieteen menetelmiä populaatioverkon päättelyyn epäjatkuvista mittauksista. Tarkemmin sanottuna Lassoa (Least Absolute Shrinkage and Selection Operator) sovelletaan naapuruston valinnassa. Näin muodostettua verkkoa hyödynnetään tarkemmassa populaatiorakenteen tarkastelussa. Havainnollistamme, kuinka verkon kommuunien (communities) tunnistaminen saattaa olla lupaava tapa tutkia populaatiorakennetta ja populaation sekoittumista (admixture) geneettisestä datasta. Lisäksi neljännessä artikkelissa näytetään, kuinka ensimmäisessä artikkelissa esiteltyä tarkkuusmatriisin estimaattoria voidaan käyttää graafisessa mallinvalinnassa usean hypoteesin testauksen avulla. Tämän väitöskirjan kolmas artikkeli sisältää yleiskatsauksen tämänhetkisistä työkaluista, joiden avulla voidaan valita graafinen malli ja estimoida tarkkuus- sekä kovarianssimatriiseja. Muissa kolmessa julkaisussa on kuvailtu yksityiskohtaisesti olennaisia laskennallisista ja matemaattisista tuloksista, joihin artikkeleissa esitellyt estimointimenetelmät perustuvat. Jokaisessa julkaisussa on kokoelma käytännöllisiä tutkimuskysymyksiä, joihin voidaan soveltaa uusia estimointimenetelmiä. Toivomme, että nämä sovellukset auttavat lukijaa ymmärtämään paremmin tässä väitöskirjassa esiteltyjen menetelmien käyttömahdollisuuksia. LASSO covariance matrix graphical model network estimation precision matrix ridge Lasso graafinen malli kovarianssimatriisi ridge tarkkuusmatriisi verkkojen estimointi high-dimensional setting
77	[en] FORECASTING IN HIGH-DIMENSION: INFLATION AND OTHER ECONOMIC VARIABLES / [pt] PREVISÃO EM ALTA DIMENSÃO: INFLAÇÃO E OUTRAS VARIÁVEIS ECONÔMICAS GABRIEL FILIPE RODRIGUES VASCONCELOS 26 September 2018 (has links) [pt] Esta tese é composta de quatro artigos e um pacote de R. Todos os artigos têm como foco previsão de variáveis econômicas em alta dimensão. O primeiro artigo mostra que modelos LASSO são muito precisos para prever a inflação brasileira em horizontes curtos de previsão. O segundo artigo utiliza vários métodos de Machine Learning para prever um grupo de variáveis macroeconomicas americanas. Os resultados mostram que uma adaptação no LASSO melhora as previsões com um alto custo computacional. O terceiro artigo também trata da previsão da inflação brasileira, mas em tempo real. Os principais resultados mostram que uma combinação de modelos de Machine Learning é mais precisa do que a previsão do especialista (FOCUS). Finalmente, o último artigo trata da previsão da inflação americana utilizando um grande conjunto de modelos. O modelo vencedor é o Random Forest, que levanta a questão da não-linearidade na inflação americana. Os resultados mostram que tanto a não-linearidade quanto a seleção de variáveis são importantes para os bons resultados do Random Forest. / [en] This thesis is made of four articles and an R package. The articles are all focused on forecasting economic variables on high-dimension. The first article shows that LASSO models are very accurate to forecast the Brazilian inflation in small horizons. The second article uses several Machine Learning models to forecast a set o US macroeconomic variables. The results show that a small adaptation in the LASSO improves the forecasts but with high computational costs. The third article is also on forecasting the Brazilian inflation, but in real-time. The main results show that a combination of Machine Learning models is more accurate than the FOCUS specialist forecasts. Finally, the last article is about forecasting the US inflation using a very large set of models. The winning model is the Random Forest, which opens the discussion of nonlinearity in the US inflation. The results show that both nonlinearity and variable selection are important features for the Random Forest performance. [pt] PREVISAO [en] FORECASTING [pt] MODELO DE FATORES [en] FACTOR MODELS [pt] BIG DATA [en] BIG DATA [pt] LASSO [en] LASSO [pt] ECONOMETRIA EM ALTA DIMENSAO [en] HIGH-DIMENSION ECONOMETRICS
78	Fine mapping and single nucleotide polymorphism effects estimation on pig chromosomes 1, 4, 7, 8, 17 and X / Mapeamento fino e estimação dos efeitos de polimorfismos de base única nos cromossomos suínos 1, 4, 7, 8, 17 e X Hidalgo, André Marubayashi 08 July 2011 (has links) Made available in DSpace on 2015-03-26T13:42:22Z (GMT). No. of bitstreams: 1 texto completo.pdf: 313433 bytes, checksum: 724d13b2161e04cdd66459909e393dfe (MD5) Previous issue date: 2011-07-08 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Mapeamento de loci de caracaterística quantitativas (QTL) geralmente resultam na detecção de regiões genômicas que explicam parte da variação quantitativa da característica. Entretanto essas regiões são muito amplas e não permitem uma acurada identificação dos genes. Dessa forma, torna-se necessário o estreitamento dos intervalos onde os QTL estão localizados. Com a seleção genômica ampla (GWS), foram desenvolvidas ferramentas estatísticas de forma a se estimar os efeitos de cada marcador. A partir dos valores desses efeitos, pode-se analisar quais são os marcadores de maiores efeitos. Assim, objetivou-se realizar o mapeamento fino dos cromossomos suínos 1, 4, 7, 8, 17, e X, usando marcadores microsatélites e polimorfismo de base única (SNP), em uma população F2 produzida pelo cruzamento de varrões da raça naturalizada brasileira Piau com fêmeas comerciais, associados com características de desempenho, carcaça, orgãos internos, cortes e qualidade de carne. Também objetivou-se estimar os efeitos dos marcadores SNP nas características que tiveram QTL detectados, analisar quais são os mais expressivos e verificar se eles estão localizados dentro do intervalo de confiança do QTL. Os QTL foram identificados por meio do método regressão por intervalo de mapeamento e as análises foram realizadas pelo software GridQTL. O efeito de cada marcador foi estimado pela regressão de LASSO Bayesiano, usando o software R. No total, 32 QTL foram encontrados ao nível cromossômico de significância de 5%, destes, 12 eram significativos ao nível cromossômico de 1% e 7 destes eram significativos ao nível genômico de 5%. Seis de sete QTL apresentaram marcadores de efeito expressivo dentro do intervalo de confiança do QTL. Resultados deste estudo confirmaram QTL de outros trabalhos e identificaram vários outros novos. Os resultados encontrados utilizando marcadores microsatélites junto com SNPs aumentaram a saturação do genoma levando a um menor intervalo de confiança dos QTL encontrados. Os métodos usados foram importantes para estimar os efeitos dos marcadores, e também para localizar aqueles com efeitos mais expressivos dentro do intervalo de confiança do QTL, validando os QTL encontrados pelo método da regressão. / Quantitative Trait Loci (QTL) mapping efforts often result in the detection of genomic regions that explain part of the quantitative trait variation. However, these regions are very large and do not allow accurate gene identification, hence the interval must be narrowed where the QTL was located. With the genome wide selection (GWS), many statistical tools have been developed in order to estimate the effects for each marker. With the marker effects values it is possible to analyze which markers have large effects. Hence, the objective of this investigation was to fine map pig chromosomes 1, 4, 7, 8, 17 and X, using microsatellites and SNP markers, in a F2 population produced by crossing naturalized Brazilian Piau boars with commercial females, associated with performance, carcass, internal organs, cut yields and meat quality traits. A further aim was to estimate the effects of single nucleotide polymorphism (SNP) markers on traits with detected QTL, analyze the most expressive ones and verify whether the markers with larger effects were indeed within the QTL confidence interval. QTL were identified by regression interval mapping using the GridQTL software. Individual marker effects were estimated by Bayesian LASSO regression using the R software. In total, 32 QTL for the studied traits were significant at the 5% chromosome-wide level, including 12 significant QTL at the 1% chromosome-wide level and 7 significant at the 5% genome-wide level. Six out of seven QTL with genome-wide significance had markers of large effect within their confidence interval. These results confirmed some previous QTL and identified numerous novel QTL for the investigated traits. Our results have shown that the use of microsatellites and SNP markers that increase the genome saturation lead to QTL of smaller confidence intervals. The methods used were also valuable to estimate the marker effects and to locate the most expressive markers within the QTL confidence interval, validating those QTL found by the regression method. Bayesian Lasso Piau breed Pig genetics QTL Lasso Bayesiano Raça Piau Genética do porco QTL
79	Využití technik genetických algoritmů a dolování z dat v testování paralelních programů s využitím vkládání šumu / Application of Genetic Algorithms and Data Mining in Noise-based Testing of Concurrent Software Šimková, Hana Unknown Date (has links) Tato práce navrhuje zlepšení výkonu testování programů použitím technik dolování z dat a genetických algoritmů při testování paralelních programů. Paralelní programování se v posledních letech stává velmi populárním i přesto, že toto programování je mnohem náročnějsí než jednodušší sekvenční a proto jeho zvýšené používání vede k podstatně vyššímu počtu chyb. Tyto chyby se vyskytují v důsledku chyb v synchronizaci jednotlivých procesů programu. Nalezení takových chyb tradičním způsobem je složité a navíc opakované spouštění těchto testů ve stejném prostředí typicky vede pouze k prohledávání stejných prokládání. V práci se využívá metody vstřikování šumu, která vystresuje program tak, že se mohou objevit některá nová chování. Pro účinnost této metody je nutné zvolit vhodné heuristiky a též i hodnoty jejich parametrů, což není snadné. V práci se využívá metod dolování z dat, genetických algoritmů a jejich kombinace pro nalezení těchto heuristik a hodnot parametrů. V práci je vedle výsledků výzkumu uveden stručný přehled dalších Technik testování paralelních programů.
80	[en] FORECASTING INDUSTRIAL PRODUCTION IN BRAZIL USING MANY PREDICTORS / [pt] PREVENDO A PRODUÇÃO INDUSTRIAL BRASILEIRA USANDO MUITOS PREDITORES LEONARDO DE PAOLI CARDOSO DE CASTRO 23 December 2016 (has links) [pt] Nesse artigo, utilizamos o índice de produção industrial brasileira para comparar a capacidade preditiva de regressões irrestritas e regressões sujeitas a penalidades usando muitos preditores. Focamos no least absolute shrinkage and selection operator (LASSO) e suas extensões. Propomos também uma combinação entre métodos de encolhimento e um algorítmo de seleção de variáveis (PVSA). A performance desses métodos foi comparada com a de um modelo de fatores. Nosso estudo apresenta três principais resultados. Em primeiro lugar, os modelos baseados no LASSO apresentaram performance superior a do modelo usado como benchmark em projeções de curto prazo. Segundo, o PSVA teve desempenho superior ao benchmark independente do horizonte de projeção. Finalmente, as variáveis com a maior capacidade preditiva foram consistentemente selecionadas pelos métodos considerados. Como esperado, essas variáveis são intimamente relacionadas à atividade industrial brasileira. Exemplos incluem a produção de veículos e a expedição de papelão. / [en] In this article we compared the forecasting accuracy of unrestricted and penalized regressions using many predictors for the Brazilian industrial production index. We focused on the least absolute shrinkage and selection operator (Lasso) and its extensions. We also proposed a combination between penalized regressions and a variable search algorithm (PVSA). Factor-based models were used as our benchmark specification. Our study produced three main findings. First, Lasso-based models over-performed the benchmark in short-term forecasts. Second, the PSVA over-performed the proposed benchmark, regardless of the horizon. Finally, the best predictive variables are consistently chosen by all methods considered. As expected, these variables are closely related to Brazilian industrial activity. Examples include vehicle production and cardboard production. [pt] PROJECAO [pt] INDICADORES ANTECEDENTES [pt] ENCOLHIMENTO [pt] SELECAO DE MODELOS [pt] LASSO [pt] PRODUCAO INDUSTRIAL [en] PROJECTION [en] MODEL SELECTION [en] LASSO [en] INDUSTRIAL PRODUCTION

Search results