Spelling suggestions: "subject:"data selection"" "subject:"mata selection""
11 |
Dataselektering en –manipulering vir statistiese Engels–Afrikaanse masjienvertaling / McKellar C.A.McKellar, Cindy. January 2011 (has links)
Die sukses van enige masjienvertaalsisteem hang grootliks van die hoeveelheid en kwaliteit van die beskikbare afrigtingsdata af. n Sisteem wat met foutiewe of lae–kwaliteit data afgerig is, sal uiteraard swakker afvoer lewer as n sisteem wat met korrekte of hoë–kwaliteit data afgerig is. In die geval van hulpbronarm tale waar daar min data beskikbaar is en data dalk noodgedwonge vertaal moet word vir die skep van parallelle korpora wat as afrigtingsdata kan dien, is dit dus baie belangrik dat die data wat vir vertaling gekies word, so gekies word dat dit teksgedeeltes insluit wat die meeste waarde tot die masjienvertaalsisteem sal bydra. Dit is ook in so n geval uiters belangrik om die beskikbare data so goed moontlik aan te wend.
Hierdie studie stel ondersoek in na metodes om afrigtingsdata te selekteer met die doel om n optimale masjienvertaalsisteem met beperkte hulpbronne af te rig. Daar word ook aandag gegee aan die moontlikheid om die gewigte van sekere gedeeltes van die afrigtingsdata te verhoog om sodoende die data wat die meeste waarde tot die masjienvertaalsisteem bydra te beklemtoon. Alhoewel hierdie studie spesifiek gerig is op metodes vir dataselektering en –manipulering vir die taalpaar Engels–Afrikaans, sou die metodes ook vir toepassing op ander taalpare gebruik kon word.
Die evaluasieproses dui aan dat beide die dataselekteringsmetodes, asook die aanpassing van datagewigte, n positiewe impak op die kwaliteit van die resulterende masjienvertaalsisteem het. Die uiteindelike sisteem, afgerig deur n kombinasie van verskillende metodes, toon n 2.0001 styging in die NIST–telling en n 0.2039 styging in die BLEU–telling. / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2011.
|
12 |
Dataselektering en –manipulering vir statistiese Engels–Afrikaanse masjienvertaling / McKellar C.A.McKellar, Cindy. January 2011 (has links)
Die sukses van enige masjienvertaalsisteem hang grootliks van die hoeveelheid en kwaliteit van die beskikbare afrigtingsdata af. n Sisteem wat met foutiewe of lae–kwaliteit data afgerig is, sal uiteraard swakker afvoer lewer as n sisteem wat met korrekte of hoë–kwaliteit data afgerig is. In die geval van hulpbronarm tale waar daar min data beskikbaar is en data dalk noodgedwonge vertaal moet word vir die skep van parallelle korpora wat as afrigtingsdata kan dien, is dit dus baie belangrik dat die data wat vir vertaling gekies word, so gekies word dat dit teksgedeeltes insluit wat die meeste waarde tot die masjienvertaalsisteem sal bydra. Dit is ook in so n geval uiters belangrik om die beskikbare data so goed moontlik aan te wend.
Hierdie studie stel ondersoek in na metodes om afrigtingsdata te selekteer met die doel om n optimale masjienvertaalsisteem met beperkte hulpbronne af te rig. Daar word ook aandag gegee aan die moontlikheid om die gewigte van sekere gedeeltes van die afrigtingsdata te verhoog om sodoende die data wat die meeste waarde tot die masjienvertaalsisteem bydra te beklemtoon. Alhoewel hierdie studie spesifiek gerig is op metodes vir dataselektering en –manipulering vir die taalpaar Engels–Afrikaans, sou die metodes ook vir toepassing op ander taalpare gebruik kon word.
Die evaluasieproses dui aan dat beide die dataselekteringsmetodes, asook die aanpassing van datagewigte, n positiewe impak op die kwaliteit van die resulterende masjienvertaalsisteem het. Die uiteindelike sisteem, afgerig deur n kombinasie van verskillende metodes, toon n 2.0001 styging in die NIST–telling en n 0.2039 styging in die BLEU–telling. / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2011.
|
13 |
Optimal Active Learning: experimental factors and membership query learningYu-hui Yeh Unknown Date (has links)
The field of Machine Learning is concerned with the development of algorithms, models and techniques that solve challenging computational problems by learning from data representative of the problem (e.g. given a set of medical images previously classified by a human expert, build a model to predict unseen images as either benign or malignant). Many important real-world problems have been formulated as supervised learning problems. The assumption is that a data set is available containing the correct output (e.g. class label or target value) for each given data point. In many application domains, obtaining the correct outputs (labels) for data points is a costly and time-consuming task. This has provided the motivation for the development of Machine Learning techniques that attempt to minimize the number of labeled data points while maintaining good generalization performance on a given problem. Active Learning is one such class of techniques and is the focus of this thesis. Active Learning algorithms select or generate unlabeled data points to be labeled and use these points for learning. If successful, an Active Learning algorithm should be able to produce learning performance (e.g test set error) comparable to an equivalent supervised learner using fewer labeled data points. Theoretical, algorithmic and experimental Active Learning research has been conducted and a number of successful applications have been demonstrated. However, the scope of many of the experimental studies on Active Learning has been relatively small and there are very few large-scale experimental evaluations of Active Learning techniques. A significant amount of performance variability exists across Active Learning experimental results in the literature. Furthermore, the implementation details and effects of experimental factors have not been closely examined in empirical Active Learning research, creating some doubt over the strength and generality of conclusions that can be drawn from such results. The Active Learning model/system used in this thesis is the Optimal Active Learning algorithm framework with Gaussian Processes for regression problems (however, most of the research questions are of general interest in many other Active Learning scenarios). Experimental and implementation details of the Active Learning system used are described in detail, using a number of regression problems and datasets of different types. It is shown that the experimental results of the system are subject to significant variability across problem datasets. The hypothesis that experimental factors can account for this variability is then investigated. The results show the impact of sampling and sizes of the datasets used when generating experimental results. Furthermore, preliminary experimental results expose performance variability across various real-world regression problems. The results suggest that these experimental factors can (to a large extent) account for the variability observed in experimental results. A novel resampling technique for Optimal Active Learning, called '3-Sets Cross-Validation', is proposed as a practical solution to reduce experimental performance variability. Further results confirm the usefulness of the technique. The thesis then proposes an extension to the Optimal Active Learning framework, to perform learning via membership queries via a novel algorithm named MQOAL. The MQOAL algorithm employs the Metropolis-Hastings Markov chain Monte Carlo (MCMC) method to sample data points for query selection. Experimental results show that MQOAL provides comparable performance to the pool-based OAL learner, using a very generic, simple MCMC technique, and is robust to experimental factors related to the MCMC implementation. The possibility of making queries in batches is also explored experimentally, with results showing that while some performance degradation does occur, it is minimal for learning in small batch sizes, which is likely to be valuable in some real-world problem domains.
|
14 |
Optimal Active Learning: experimental factors and membership query learningYu-hui Yeh Unknown Date (has links)
The field of Machine Learning is concerned with the development of algorithms, models and techniques that solve challenging computational problems by learning from data representative of the problem (e.g. given a set of medical images previously classified by a human expert, build a model to predict unseen images as either benign or malignant). Many important real-world problems have been formulated as supervised learning problems. The assumption is that a data set is available containing the correct output (e.g. class label or target value) for each given data point. In many application domains, obtaining the correct outputs (labels) for data points is a costly and time-consuming task. This has provided the motivation for the development of Machine Learning techniques that attempt to minimize the number of labeled data points while maintaining good generalization performance on a given problem. Active Learning is one such class of techniques and is the focus of this thesis. Active Learning algorithms select or generate unlabeled data points to be labeled and use these points for learning. If successful, an Active Learning algorithm should be able to produce learning performance (e.g test set error) comparable to an equivalent supervised learner using fewer labeled data points. Theoretical, algorithmic and experimental Active Learning research has been conducted and a number of successful applications have been demonstrated. However, the scope of many of the experimental studies on Active Learning has been relatively small and there are very few large-scale experimental evaluations of Active Learning techniques. A significant amount of performance variability exists across Active Learning experimental results in the literature. Furthermore, the implementation details and effects of experimental factors have not been closely examined in empirical Active Learning research, creating some doubt over the strength and generality of conclusions that can be drawn from such results. The Active Learning model/system used in this thesis is the Optimal Active Learning algorithm framework with Gaussian Processes for regression problems (however, most of the research questions are of general interest in many other Active Learning scenarios). Experimental and implementation details of the Active Learning system used are described in detail, using a number of regression problems and datasets of different types. It is shown that the experimental results of the system are subject to significant variability across problem datasets. The hypothesis that experimental factors can account for this variability is then investigated. The results show the impact of sampling and sizes of the datasets used when generating experimental results. Furthermore, preliminary experimental results expose performance variability across various real-world regression problems. The results suggest that these experimental factors can (to a large extent) account for the variability observed in experimental results. A novel resampling technique for Optimal Active Learning, called '3-Sets Cross-Validation', is proposed as a practical solution to reduce experimental performance variability. Further results confirm the usefulness of the technique. The thesis then proposes an extension to the Optimal Active Learning framework, to perform learning via membership queries via a novel algorithm named MQOAL. The MQOAL algorithm employs the Metropolis-Hastings Markov chain Monte Carlo (MCMC) method to sample data points for query selection. Experimental results show that MQOAL provides comparable performance to the pool-based OAL learner, using a very generic, simple MCMC technique, and is robust to experimental factors related to the MCMC implementation. The possibility of making queries in batches is also explored experimentally, with results showing that while some performance degradation does occur, it is minimal for learning in small batch sizes, which is likely to be valuable in some real-world problem domains.
|
15 |
Redes Bayesianas aplicadas a estimação da taxa de prêmio de seguro agrícola de produtividade / Bayesian networks applied to estimation of yield insurance premiumPolo, Lucas 08 July 2016 (has links)
Informações que caracterizam o risco quebra de produção agrícola são necessárias para a precificação de prêmio do seguro agrícola de produção e de renda. A distribuição de probabilidade da variável rendimento agrícola é uma dessas informações, em especial aquela que descreve a variável aleatória rendimento agrícola condicionada aos fatores de risco climáticos. Este trabalho objetiva aplicar redes Bayesianas (grafo acíclico direcionado, ou modelo hierárquico Bayesiano) a estimação da distribuição de probabilidade de rendimento da soja em alguns municípios do Paraná, com foco na analise comparativa de riscos. Dados meteorológicos (ANA e INMET, período de 1970 a 2011) e de sensoriamento remoto (MODIS, período de 2000 a 2011) são usados conjuntamente para descrever espacialmente o risco climático de quebra de produção. Os dados de rendimento usados no estudo (COAMO, período de 2001 a 2011) requerem agrupamento de todos os dados ao nível municipal e, para tanto, a seleção de dados foi realizada nas dimensões espacial e temporal por meio de um mapa da cultura da soja (estimado por SVM - support vector machine) e os resultados de um algoritmo de identificação de ciclo de culturas. A interpolação requerida para os dados de temperatura utilizou uma componente de tendência estimada por dados de sensoriamento remoto, para descrever variações espaciais da variável que são ofuscadas pelos métodos tradicionais de interpolação. Como resultados, identificou-se relação significativa entre a temperatura observada por estações meteorológicas e os dados de sensoriamento remoto, apoiando seu uso conjunto nas estimativas. O classificador que estima o mapa da cultura da soja apresenta sobre-ajuste para safras das quais as amostras usadas no treinamento foram coletadas. Além da seleção de dados, a identificação de ciclo também permitiu obtenção de distribuições de datas de plantio da cultura da soja para o estado do Paraná. As redes bayesianas apresentam grande potencial e algumas vantagens quando aplicadas na modelagem de risco agrícola. A representação da distribuição de probabilidade por um grafo facilita o entendimento de problemas complexos, por suposições de causalidade, e facilita o ajuste, estruturação e aplicação do modelo probabilístico. A distribuição log-normal demonstrou-se a mais adequada para a modelagem das variáveis de ambiente (soma térmica, chuva acumulada e maior período sem chuva), e a distribuição beta para produtividade relativa e índices de estado (amplitude de NDVI e de EVI). No caso da regressão beta, o parâmetro de precisão também foi modelado com dependência das variáveis explicativas melhorando o ajuste da distribuição. O modelo probabilístico se demonstrou pouco representativo subestimando bastante as taxas de prêmio de seguro em relação a taxas praticadas no mercado, mas ainda assim apresenta contribui para o entendimento comparativo de situações de risco de quebra de produção da cultura da soja. / Information that characterize the risk of crop losses are necessary to crop and revenue insurance underwriting. The probability distribution of yield is one of this information. This research applies Bayesian networks (direct acyclic graph, or hierarchical Bayesian model) to estimate the probability distribution of soybean yield for some counties in Paraná state (Brazil) with focus on risk comparative analysis. Meteorological data (ANA and INMET, from 1970 to 2011) and remote sensing data (MODIS, from 2001 to 2011) were used to describe spatially the climate risk of production loss. The yield data used in this study (COAMO, from 2001 to 2011) required grouping to county level and, for that, a process of data selection was performed on spatial and temporal dimensions by a crop map (estimated by SVM - support vector machine) and by the results of a crop cycle identification algorithm. The interpolation required to spatialize temperature required a trend component which was estimated by remote sensing data, to describe the spatial variations of the variable obfuscated by traditional interpolation methods. As results, a significant relation between temperature from meteorological stations and remote sensing data was found, sustaining the use of the supposed relation between the two variables. The soybean map classifier shown over-fitting for the crop seasons for which the training samples were collected. Besides the data collection, a seeding dates distribution of soybean in Paraná state was obtained from the crop cycle identification process. The Bayesian networks showed big potential and some advantages when applied to agronomic risk modeling. The representation of the probability distribution by graphs helps the understanding of complex problems, with causality suppositions, and also helps the fitting, structuring and application of the probabilistic model. The log-normal probability distribution showed to be the best to model environment variables (thermal sum, accumulated precipitation and biggest period without rain), and the beta distribution to be the best to model relative yield and state indexes (NDVI and EVI ranges). In the case of beta regression, the precision parameter was also modeled with explanation variables as dependencies increasing the quality of the distribution fitting. In the overall, the probabilistic model had low representativity underestimating the premium rates, however it contributes to understand scenarios with risk of yield loss for the soybean crop.
|
16 |
Traffic data sampling for air pollution estimation at different urban scales / Échantillonnage des données de trafic pour l’estimation de la pollution atmosphérique aux différentes échelles urbainesSchiper, Nicole 09 October 2017 (has links)
La circulation routière est une source majeure de pollution atmosphérique dans les zones urbaines. Les décideurs insistent pour qu’on leur propose de nouvelles solutions, y compris de nouvelles stratégies de management qui pourraient directement faire baisser les émissions de polluants. Pour évaluer les performances de ces stratégies, le calcul des émissions de pollution devrait tenir compte de la dynamique spatiale et temporelle du trafic. L’utilisation de capteurs traditionnels sur route (par exemple, capteurs inductifs ou boucles de comptage) pour collecter des données en temps réel est nécessaire mais pas suffisante en raison de leur coût de mise en oeuvre très élevé. Le fait que de telles technologies, pour des raisons pratiques, ne fournissent que des informations locales est un inconvénient. Certaines méthodes devraient ensuite être appliquées pour étendre cette information locale à une grande échelle. Ces méthodes souffrent actuellement des limites suivantes : (i) la relation entre les données manquantes et la précision de l’estimation ne peut être facilement déterminée et (ii) les calculs à grande échelle sont énormément coûteux, principalement lorsque les phénomènes de congestion sont considérés. Compte tenu d’une simulation microscopique du trafic couplée à un modèle d’émission, une approche innovante de ce problème est mise en oeuvre. Elle consiste à appliquer des techniques de sélection statistique qui permettent d’identifier les emplacements les plus pertinents pour estimer les émissions des véhicules du réseau à différentes échelles spatiales et temporelles. Ce travail explore l’utilisation de méthodes statistiques intelligentes et naïves, comme outil pour sélectionner l’information la plus pertinente sur le trafic et les émissions sur un réseau afin de déterminer les valeurs totales à plusieurs échelles. Ce travail met également en évidence quelques précautions à prendre en compte quand on calcul les émissions à large échelle à partir des données trafic et d’un modèle d’émission. L’utilisation des facteurs d’émission COPERT IV à différentes échelles spatio-temporelles induit un biais en fonction des conditions de circulation par rapport à l’échelle d’origine (cycles de conduite). Ce biais observé sur nos simulations a été quantifié en fonction des indicateurs de trafic (vitesse moyenne). Il a également été démontré qu’il avait une double origine : la convexité des fonctions d’émission et la covariance des variables de trafic. / Road traffic is a major source of air pollution in urban areas. Policy makers are pushing for different solutions including new traffic management strategies that can directly lower pollutants emissions. To assess the performances of such strategies, the calculation of pollution emission should consider spatial and temporal dynamic of the traffic. The use of traditional on-road sensors (e.g. inductive sensors) for collecting real-time data is necessary but not sufficient because of their expensive cost of implementation. It is also a disadvantage that such technologies, for practical reasons, only provide local information. Some methods should then be applied to expand this local information to large spatial extent. These methods currently suffer from the following limitations: (i) the relationship between missing data and the estimation accuracy, both cannot be easily determined and (ii) the calculations on large area is computationally expensive in particular when time evolution is considered. Given a dynamic traffic simulation coupled with an emission model, a novel approach to this problem is taken by applying selection techniques that can identify the most relevant locations to estimate the network vehicle emissions in various spatial and temporal scales. This work explores the use of different statistical methods both naïve and smart, as tools for selecting the most relevant traffic and emission information on a network to determine the total values at any scale. This work also highlights some cautions when such traffic-emission coupled method is used to quantify emissions due the traffic. Using the COPERT IV emission functions at various spatial-temporal scales induces a bias depending on traffic conditions, in comparison to the original scale (driving cycles). This bias observed in our simulations, has been quantified in function of traffic indicators (mean speed). It also has been demonstrated to have a double origin: the emission functions’ convexity and the traffic variables covariance.
|
17 |
A General System for Supervised Biomedical Image SegmentationChen, Cheng 15 March 2013 (has links)
Image segmentation is important with applications to several problems in biology and medicine. While extensively researched, generally, current segmentation methods perform adequately in the applications for which they were designed, but often require extensive modifications or calibrations before used in a different application. We describe a system that, with few modifications, can be used in a variety of image segmentation problems. The system is based on a supervised learning strategy that utilizes intensity neighborhoods to assign each pixel in a test image its correct class based on training data. In summary, we have several innovations: (1) A general framework for such a system is proposed, where rotations and variations of intensity neighborhoods in scales are modeled, and a multi-scale classification framework is utilized to segment unknown images; (2) A fast algorithm for training data selection and pixel classification is presented, where a majority voting based criterion is proposed for selecting a small subset from raw training set. When combined with 1-nearest neighbor (1-NN) classifier, such an algorithm is able to provide descent classification accuracy within reasonable computational complexity. (3) A general deformable model for optimization of segmented regions is proposed, which takes the decision values from previous pixel classification process as input, and optimize the segmented regions in a partial differential equation (PDE) framework. We show that the performance of this system in several different biomedical applications, such as tissue segmentation tasks in magnetic resonance and histopathology microscopy images, as well as nuclei segmentation from fluorescence microscopy images, is similar or better than several algorithms specifically designed for each of these applications.
In addition, we describe another general segmentation system for biomedical applications where a strong prior on shape is available (e.g. cells, nuclei). The idea is based on template matching and supervised learning, and we show the examples of segmenting cells and nuclei from microscopy images. The method uses examples selected by a user for building a statistical model which captures the texture and shape variations of the nuclear structures from a given data set to be segmented. Segmentation of subsequent, unlabeled, images is then performed by finding the model instance that best matches (in the normalized cross correlation sense) local neighborhood in the input image. We demonstrate the application of our method to segmenting cells and nuclei from a variety of imaging modalities, and quantitatively compare our results to several other methods. Quantitative results using both simulated and real image data show that, while certain methods may work well for certain imaging modalities, our software is able to obtain high accuracy across several imaging modalities studied. Results also demonstrate that, relative to several existing methods, the template based method we propose presents increased robustness in the sense of better handling variations in illumination, variations in texture from different imaging modalities, providing more smooth and accurate segmentation borders, as well as handling better cluttered cells and nuclei.
|
18 |
Redes Bayesianas aplicadas a estimação da taxa de prêmio de seguro agrícola de produtividade / Bayesian networks applied to estimation of yield insurance premiumLucas Polo 08 July 2016 (has links)
Informações que caracterizam o risco quebra de produção agrícola são necessárias para a precificação de prêmio do seguro agrícola de produção e de renda. A distribuição de probabilidade da variável rendimento agrícola é uma dessas informações, em especial aquela que descreve a variável aleatória rendimento agrícola condicionada aos fatores de risco climáticos. Este trabalho objetiva aplicar redes Bayesianas (grafo acíclico direcionado, ou modelo hierárquico Bayesiano) a estimação da distribuição de probabilidade de rendimento da soja em alguns municípios do Paraná, com foco na analise comparativa de riscos. Dados meteorológicos (ANA e INMET, período de 1970 a 2011) e de sensoriamento remoto (MODIS, período de 2000 a 2011) são usados conjuntamente para descrever espacialmente o risco climático de quebra de produção. Os dados de rendimento usados no estudo (COAMO, período de 2001 a 2011) requerem agrupamento de todos os dados ao nível municipal e, para tanto, a seleção de dados foi realizada nas dimensões espacial e temporal por meio de um mapa da cultura da soja (estimado por SVM - support vector machine) e os resultados de um algoritmo de identificação de ciclo de culturas. A interpolação requerida para os dados de temperatura utilizou uma componente de tendência estimada por dados de sensoriamento remoto, para descrever variações espaciais da variável que são ofuscadas pelos métodos tradicionais de interpolação. Como resultados, identificou-se relação significativa entre a temperatura observada por estações meteorológicas e os dados de sensoriamento remoto, apoiando seu uso conjunto nas estimativas. O classificador que estima o mapa da cultura da soja apresenta sobre-ajuste para safras das quais as amostras usadas no treinamento foram coletadas. Além da seleção de dados, a identificação de ciclo também permitiu obtenção de distribuições de datas de plantio da cultura da soja para o estado do Paraná. As redes bayesianas apresentam grande potencial e algumas vantagens quando aplicadas na modelagem de risco agrícola. A representação da distribuição de probabilidade por um grafo facilita o entendimento de problemas complexos, por suposições de causalidade, e facilita o ajuste, estruturação e aplicação do modelo probabilístico. A distribuição log-normal demonstrou-se a mais adequada para a modelagem das variáveis de ambiente (soma térmica, chuva acumulada e maior período sem chuva), e a distribuição beta para produtividade relativa e índices de estado (amplitude de NDVI e de EVI). No caso da regressão beta, o parâmetro de precisão também foi modelado com dependência das variáveis explicativas melhorando o ajuste da distribuição. O modelo probabilístico se demonstrou pouco representativo subestimando bastante as taxas de prêmio de seguro em relação a taxas praticadas no mercado, mas ainda assim apresenta contribui para o entendimento comparativo de situações de risco de quebra de produção da cultura da soja. / Information that characterize the risk of crop losses are necessary to crop and revenue insurance underwriting. The probability distribution of yield is one of this information. This research applies Bayesian networks (direct acyclic graph, or hierarchical Bayesian model) to estimate the probability distribution of soybean yield for some counties in Paraná state (Brazil) with focus on risk comparative analysis. Meteorological data (ANA and INMET, from 1970 to 2011) and remote sensing data (MODIS, from 2001 to 2011) were used to describe spatially the climate risk of production loss. The yield data used in this study (COAMO, from 2001 to 2011) required grouping to county level and, for that, a process of data selection was performed on spatial and temporal dimensions by a crop map (estimated by SVM - support vector machine) and by the results of a crop cycle identification algorithm. The interpolation required to spatialize temperature required a trend component which was estimated by remote sensing data, to describe the spatial variations of the variable obfuscated by traditional interpolation methods. As results, a significant relation between temperature from meteorological stations and remote sensing data was found, sustaining the use of the supposed relation between the two variables. The soybean map classifier shown over-fitting for the crop seasons for which the training samples were collected. Besides the data collection, a seeding dates distribution of soybean in Paraná state was obtained from the crop cycle identification process. The Bayesian networks showed big potential and some advantages when applied to agronomic risk modeling. The representation of the probability distribution by graphs helps the understanding of complex problems, with causality suppositions, and also helps the fitting, structuring and application of the probabilistic model. The log-normal probability distribution showed to be the best to model environment variables (thermal sum, accumulated precipitation and biggest period without rain), and the beta distribution to be the best to model relative yield and state indexes (NDVI and EVI ranges). In the case of beta regression, the precision parameter was also modeled with explanation variables as dependencies increasing the quality of the distribution fitting. In the overall, the probabilistic model had low representativity underestimating the premium rates, however it contributes to understand scenarios with risk of yield loss for the soybean crop.
|
19 |
[en] DATA SELECTION FOR LVQ / [pt] SELEÇÃO DE DADOS EM LVQRODRIGO TOSTA PERES 20 September 2004 (has links)
[pt] Nesta dissertação, propomos uma metodologia para seleção de
dados em
modelos de Aprendizado por Quantização Vetorial,
referenciado amplamente na
literatura pela sigla em inglês LVQ. Treinar um modelo
(ajuste dentro-daamostra)
com um subconjunto selecionado a partir do conjunto de dados
disponíveis para o aprendizado pode trazer grandes
benefícios no resultado de
generalização (fora-da-amostra). Neste sentido, é muito
importante realizar uma
busca para selecionar dados que, além de serem
representativos de suas
distribuições originais, não sejam ruído (no sentido
definido ao longo desta
dissertação). O método proposto procura encontrar os pontos
relevantes do
conjunto de entrada, tendo como base a correlação do erro
de cada ponto com o
erro do restante da distribuição. Procura-se, em geral,
eliminar considerável parte
do ruído mantendo os pontos que são relevantes para o
ajuste do modelo
(aprendizado). Assim, especificamente em LVQ, a atualização
dos protótipos
durante o aprendizado é realizada com um subconjunto do
conjunto de
treinamento originalmente disponível. Experimentos
numéricos foram realizados
com dados simulados e reais, e os resultados obtidos foram
muito interessantes,
mostrando claramente a potencialidade do método proposto. / [en] In this dissertation, we consider a methodology for
selection of data in
models of Learning Vector Quantization (LVQ). The
generalization can be
improved by using a subgroup selected from the available
data set. We search the
original distribution to select relevant data that aren't
noise. The search aims at
relevant points in the training set based on the
correlation between the error of
each point and the average of error of the remaining data.
In general, it is desired
to eliminate a considerable part of the noise, keeping the
points that are relevant
for the learning model. Thus, specifically in LVQ, the
method updates the
prototypes with a subgroup of the originally available
training set. Numerical
experiments have been done with simulated and real data.
The results were very
interesting and clearly indicated the potential of the
method.
|
20 |
Volba a optimalizace řezných podmínek pro progresivní výrobní technologii zalomeného hřídele / Data selection and optimisation of cutting conditions for progressive production technology of the crank shaftSonberger, Vít January 2015 (has links)
This thesis is focused on proposal of production process of assembled crankshaft. It consists of choice of tools, data selection and optimisation for manufacture of individual components and for the assembly. In the assembly are also calculated important parameters for pressing components. Selected cutting conditions are experimentally verified.
|
Page generated in 0.0963 seconds