Global ETD Search

201	An Investigation of Unidimensional Testing Procedures under Latent Trait Theory using Principal Component Analysis McGill, Michael T. 11 December 2009 (has links) There are several generally accepted rules for detecting unidimensionality, but none are well tested. This simulation study investigated well-known methods, including but not limited to, the Kaiser (k>1) Criterion, Percentage of Measure Validity (greater than 50%, 40%, or 20%), Ratio of Eigenvalues, and Kelley method, and compares these methods to each other and a new method proposed by the author (McGill method) for assessing unidimensionality. After applying principal component analysis (PCA) to the residuals of a Latent Trait Test Theory (LTTT) model, this study was able to address three purposes: determining the Type I error rates associated with various criterion values, for assessing unidimensionality; determining the Type II error rates and statistical power associated with various rules of thumb when assessing dimensionality; and, finally, determining whether more suitable criterion values could be established for the methods of the study by accounting for various characteristics of the measurement context. For those methods based on criterion values, new modified values are proposed. For those methods without criterion values for dimensionality decisions, criterion values are modeled and presented. The methods compared in this study were investigated using PCA on residuals from the Rasch model. The sample size, test length, ability distribution variability, and item distribution variability were varied and the resulting Type I and Type II error rates of each method were examined. The results imply that certain conditions can cause improper diagnoses as to the dimensionality of instruments. Adjusted methods are suggested to induce a more stable condition relative to the Type I and Type II error rates. The nearly ubiquitous Kaiser method was found to be biased towards signaling multidimensionality whether it exists or not. The modified version of the Kaiser method and the McGill method, proposed by the author were shown to be among the best at detecting unidimensionality when it was present. In short, methods that take into account changes in variables such as sample size, test length, item variability, and person variability are better than methods that use a single, static criterionvalue in decision making with respect to dimensionality. / Ph. D. Unidimensionality Principal Component Analysis Dimensionality Measurement IRT Item Resonse THeory Rasch
202	Effects of Manufacturing Deviations on Core Compressor Blade Performance De Losier, Clayton Ray 20 April 2009 (has links) There has been recent incentive for understanding the possible deleterious effects that manufacturing deviations can have on compressor blade performance. This is of particular importance in today's age, as compressor designs are pushing operating limits by employing fewer stages with higher loadings and are designed to operate at ever higher altitudes. Deviations in these advanced, as well as legacy designs, could negatively affect the performance and operation of a core compressor; thus, a numerical investigation to quantify manufacturing deviations and their effects is undertaken. Data from three radial sections of every compressor blade in a single row of a production compressor is used as the basis for this investigation. Deviations from the compressor blade design intent to the as-manufactured blades are quantified with a statistical method known as principle component analysis (PCA). MISES, an Euler solver coupled with integral boundary-layer calculations, is used to analyze the effects that the aforementioned deviations have on compressor blade performance when the inlet flow conditions produce a Mach number of approximately 0.7 and a Reynolds number of approximately 6.5e5. It was found that the majority of manufacturing deviations were within a range of plus or minus 4 percent of the design intent, and deviations at the leading edge had a critical effect on performance. Of particular interest is the fact that deviations at the leading edge not only degraded performance but significantly changed the boundary-layer behavior from that of the design case. / Master of Science manufacturing deviations MISES gas turbine compressor leading edge principal component analysis
203	Segmentation of the market for labeled ornamental plants by environmental preferences: A latent class analysis D'Alessio, Nicole Marie 09 July 2015 (has links) Labeling is a product differentiation mechanism which has increased in prevalence across many markets. This study investigated the potential for a labeling program applied in ornamental plant sales, given key ongoing issues affecting ornamental plant producers: irrigation water use and plant disease. Our research investigated how to better understand the market for plants certified as disease free and/or produced using water conservation techniques through segmenting the market by consumers' environmental preferences. Latent class analysis was conducted using choice modeling survey results and respondent scores on the New Environmental Paradigm scale. The results show that when accounting for environmental preferences, consumers can be grouped into two market segments. Relative to each other, these segments are considered: price sensitive and attribute sensitive. Our research also investigated market segments' preferences for multiple certifying authorities. The results strongly suggest that consumers of either segment do not have a preference for any particular certifying authority. / Master of Science Latent class analysis ornamental plants principal component analysis product labeling environmental certification disease management water conservation
204	Modified Kernel Principal Component Analysis and Autoencoder Approaches to Unsupervised Anomaly Detection Merrill, Nicholas Swede 01 June 2020 (has links) Unsupervised anomaly detection is the task of identifying examples that differ from the normal or expected pattern without the use of labeled training data. Our research addresses shortcomings in two existing anomaly detection algorithms, Kernel Principal Component Analysis (KPCA) and Autoencoders (AE), and proposes novel solutions to improve both of their performances in the unsupervised settings. Anomaly detection has several useful applications, such as intrusion detection, fault monitoring, and vision processing. More specifically, anomaly detection can be used in autonomous driving to identify obscured signage or to monitor intersections. Kernel techniques are desirable because of their ability to model highly non-linear patterns, but they are limited in the unsupervised setting due to their sensitivity of parameter choices and the absence of a validation step. Additionally, conventionally KPCA suffers from a quadratic time and memory complexity in the construction of the gram matrix and a cubic time complexity in its eigendecomposition. The problem of tuning the Gaussian kernel parameter, $sigma$, is solved using the mini-batch stochastic gradient descent (SGD) optimization of a loss function that maximizes the dispersion of the kernel matrix entries. Secondly, the computational time is greatly reduced, while still maintaining high accuracy by using an ensemble of small, textit{skeleton} models and combining their scores. The performance of traditional machine learning approaches to anomaly detection plateaus as the volume and complexity of data increases. Deep anomaly detection (DAD) involves the applications of multilayer artificial neural networks to identify anomalous examples. AEs are fundamental to most DAD approaches. Conventional AEs rely on the assumption that a trained network will learn to reconstruct normal examples better than anomalous ones. In practice however, given sufficient capacity and training time, an AE will generalize to reconstruct even very rare examples. Three methods are introduced to more reliably train AEs for unsupervised anomaly detection: Cumulative Error Scoring (CES) leverages the entire history of training errors to minimize the importance of early stopping and Percentile Loss (PL) training aims to prevent anomalous examples from contributing to parameter updates. Lastly, early stopping via Knee detection aims to limit the risk of over training. Ultimately, the two new modified proposed methods of this research, Unsupervised Ensemble KPCA (UE-KPCA) and the modified training and scoring AE (MTS-AE), demonstrates improved detection performance and reliability compared to many baseline algorithms across a number of benchmark datasets. / Master of Science / Anomaly detection is the task of identifying examples that differ from the normal or expected pattern. The challenge of unsupervised anomaly detection is distinguishing normal and anomalous data without the use of labeled examples to demonstrate their differences. This thesis addresses shortcomings in two anomaly detection algorithms, Kernel Principal Component Analysis (KPCA) and Autoencoders (AE) and proposes new solutions to apply them in the unsupervised setting. Ultimately, the two modified methods, Unsupervised Ensemble KPCA (UE-KPCA) and the Modified Training and Scoring AE (MTS-AE), demonstrates improved detection performance and reliability compared to many baseline algorithms across a number of benchmark datasets. Machine learning Deep learning (Machine learning) Anomaly Detection Autoencoder Kernel Principal Component Analysis
205	Applications of Sensory Analysis for Water Quality Assessment Byrd, Julia Frances 30 January 2018 (has links) In recent years, communities that source raw water from the Dan River experienced two severe and unprecedented outbreaks of unpleasant tastes and odors in their drinking water. During both TandO events strong 'earthy', 'musty' odors were reported, but the source was not identified. The first TandO event began in early February, 2015 and coincided with an algal bloom in the Dan River. The algal bloom was thought to be the cause, but after the bloom dissipated, odors persisted until May 2015. The second TandO in October, 2015 did not coincide with observed algal blooms. On February 2, 2014 approximately 39,000 tons of coal ash from a Duke Energy coal ash pond was spilled into the Dan River near Eden, NC. As there were no documented TandO events before the spill, there is concern the coal ash adversely impacted water quality and biological communities in the Dan River leading to the TandO events. In addition to the coal ash spill, years of industrial and agricultural activity in the Dan River area may have contributed to the TandO events. The purpose of this research was to elucidate causes of the two TandO events and provide guidance to prevent future problems. Monthly water samples were collected from August, 2016 to September, 2017 from twelve sites along the Dan and Smith Rivers. Multivariate analyses were applied to look for underlying factors, spatial or temporal trends in the data. There were no reported TandO events during the project but sensory analysis, Flavor Profile Analysis, characterized earthy/musty odors present. No temporal or spatial trends of odors were observed. Seven earthy/musty odorants commonly associated with TandO events were detected. Odor intensity was mainly driven by geosmin, but no relationship between strong odors and odorants was observed. / Master of Science / In recent years, communities that source water from the Dan River experienced two severe and unprecedented outbreaks of unpleasant tastes and odors (T&O) in their drinking water. During both odor events strong ‘earthy’, ‘musty’ odors were reported, but the source was not identified. The first event began in early February, 2015 and coincided with an algal bloom in the Dan River. The algal bloom was thought to be the cause, but after the bloom dissipated, odors persisted until May 2015. The odors returned in October, 2015 but did not coincide with an algal bloom. On February 2, 2014 approximately 39,000 tons of coal ash from a Duke Energy coal ash pond was spilled into the Dan River near Eden, NC. As no documented odor events occurred before the spill, there is concern the coal ash adversely impacted the water quality in the Dan River leading to the odor events. The purpose of this research was to elucidate causes of the two odor events and provide guidance to prevent future problems. Monthly water samples were collected from August, 2016 to September, 2017 from twelve sites along the Dan and Smith Rivers. Multivariate analyses were applied to look for important factors. There were no reported odor events during the project but sensory analysis characterized earthy/musty odors present. No temporal or spatial trends of odors were observed. Seven earthy/musty odorants commonly associated with odor events were detected. Coal-ash taste and odors in drinking water flavor profile analysis principal component analysis
206	Principal component analysis of gasoline DART-MS data for forensic source attribution Vanderfeen, Allison M. 14 November 2024 (has links) Rapid and reliable techniques are necessary for the analysis of accelerants, including gasoline, from fire debris evidence in forensic arson investigations. Gasoline additives can be used as chemical attribute signatures (CAS) to distinguish between source locations due to the variation in additives used. Source attribution using CAS is needed in forensic chemistry, as the determination of a single gasoline source could be a potential investigation tool for law enforcement and other agencies conducting arson investigation. Direct analysis in real time-mass spectrometry (DART-MS) has had increasing popularity in the field of forensic chemistry for chemical analysis, and it has been applied to fire debris analysis. DART-MS has great capacity for gasoline source attribution due to its ionization technique and inclusion of higher molecular weight ions, which correspond to the CAS in gasoline. To test the hypothesis of gasoline source attribution, 21 gasoline samples were collected across Massachusetts, New Hampshire, and Connecticut. DART-MS data were generated for each sample of gasoline in replicates of 10. The data were grouped based on geographical location and evaluated by Principal Component Analysis (PCA). PCA was used to evaluate the similarities and differences in gasoline DART-MS data by generating and classifying the gasoline sample groups formed. Leave-one-out cross-validation (LOOCV) was performed on each geographical group after PCA. LOOCV was used as the validation technique to determine the validity of the model and asses its capability at classifying unknown gasoline samples. DART-MS data across geographical groups was found to have varying levels of similarity and difference through visual inspection of the mass spectra. PCA showed distinct groupings of individual gasoline samples across all tested geographical groups, with 3 out of 6 geographical groups showing no overlap between gasoline sample classifications. Two groups showed minimal overlapping, while 1 group had overlapping between multiple gasoline sample classifications. Three groups had a LOOCV of 100% with no misclassifications. The other LOOCV were 98%, 96.67%, and 85%. The PCA and comparison of DART-MS data provides evidence of successful differentiation between gasoline samples of the same brand across Massachusetts, New Hampshire, and Connecticut. This research aims to provide an overview and understanding of chemometrics and DART-MS and how these techniques may be applied for forensic source attribution purposes. Chemistry DART-MS Forensic Gasoline Principal component analysis Source attribution
207	A machine learning approach for ethnic classification: the British Pakistani face Khalid Jilani, Shelina, Ugail, Hassan, Bukar, Ali M., Logan, Andrew J., Munshi, Tasnim January 2017 (has links) No / Ethnicity is one of the most salient clues to face identity. Analysis of ethnicity-specific facial data is a challenging problem and predominantly carried out using computer-based algorithms. Current published literature focusses on the use of frontal face images. We addressed the challenge of binary (British Pakistani or other ethnicity) ethnicity classification using profile facial images. The proposed framework is based on the extraction of geometric features using 10 anthropometric facial landmarks, within a purpose-built, novel database of 135 multi-ethnic and multi-racial subjects and a total of 675 face images. Image dimensionality was reduced using Principle Component Analysis and Partial Least Square Regression. Classification was performed using Linear Support Vector Machine. The results of this framework are promising with 71.11% ethnic classification accuracy using a PCA algorithm + SVM as a classifier, and 76.03% using PLS algorithm + SVM as a classifier. Face Principal component analysis Support vector machines Feature extraction Classification algorithms Algorithm design and analysis Databases
208	Unsupervised Learning for Efficient Underwriting Dalla Torre, Elena January 2024 (has links) In the field of actuarial science, statistical methods have been extensively studied toestimate the risk of insurance. These methods are good at estimating the risk of typicalinsurance policies, as historical data is available. However, their performance can be pooron unique insurance policies, which require the manual assessment of an underwriter. Aclassification of insurance policies on a unique/typical scale would help insurance companiesallocate manual resources more efficiently and validate the goodness of fit of thepricing models on unique objects. The aim of this thesis is to use outlier detection methodsto identify unique non-life insurance policies. The many categorical nominal variablespresent in insurance policy data sets represent a challenge when applying outlier detectionmethods. Therefore, we also explore different ways to derive informative numericalrepresentations of categorical nominal variables. First, as a baseline, we use the principalcomponent analysis of mixed data to find a numerical representation of categorical nominalvariables and the principal component analysis to identify unique insurances. Then,we see whether better performance can be achieved using autoencoders which can capturecomplex non-linearities. In particular, we learn a numerical representation of categoricalnominal variables using the encoder layer of an autoencoder, and we use a different autoencoderto identify unique insurances. Since we are in an unsupervised setting, the twomethods are compared by performing a simulation study and using the NLS-KDD dataset. The analysis shows autoencoders are superior at identifying unique objects than principalcomponent analysis. We conclude that the ability of autoencoders to model complexnon-linearities between the variables allows for this class of methods to achieve superiorperformance. Datadriven Underwriting Outlier Detection Autoencoders Principal Component Analysis Representation Learning Probability Theory and Statistics Sannolikhetsteori och statistik
209	Emprego de técnicas de análise exploratória de dados utilizados em Química Medicinal / Use of different techniques for exploratory data analysis in Medicinal Chemistry Gertrudes, Jadson Castro 10 September 2013 (has links) Pesquisas na área de Química Medicinal têm direcionado esforços na busca por métodos que acelerem o processo de descoberta de novos medicamentos. Dentre as diversas etapas relacionadas ao longo do processo de descoberta de substâncias bioativas está a análise das relações entre a estrutura química e a atividade biológica de compostos. Neste processo, os pesquisadores da área de Química Medicinal analisam conjuntos de dados que são caracterizados pela alta dimensionalidade e baixo número de observações. Dentro desse contexto, o presente trabalho apresenta uma abordagem computacional que visa contribuir para a análise de dados químicos e, consequentemente, a descoberta de novos medicamentos para o tratamento de doenças crônicas. As abordagens de análise exploratória de dados, utilizadas neste trabalho, combinam técnicas de redução de dimensionalidade e de agrupamento para detecção de estruturas naturais que reflitam a atividade biológica dos compostos analisados. Dentre as diversas técnicas existentes para a redução de dimensionalidade, são discutidas o escore de Fisher, a análise de componentes principais e a análise de componentes principais esparsas. Quanto aos algoritmos de aprendizado, são avaliados o k-médias, fuzzy c-médias e modelo de misturas ICA aperfeiçoado. No desenvolvimento deste trabalho foram utilizados quatro conjuntos de dados, contendo informações de substâncias bioativas, sendo que dois conjuntos foram relacionados ao tratamento da diabetes mellitus e da síndrome metabólica, o terceiro conjunto relacionado a doenças cardiovasculares e o último conjunto apresenta substâncias que podem ser utilizadas no tratamento do câncer. Nos experimentos realizados, os resultados alcançados sugerem a utilização das técnicas de redução de dimensionalidade juntamente com os algoritmos não supervisionados para a tarefa de agrupamento dos dados químicos, uma vez que nesses experimentos foi possível descrever níveis de atividade biológica dos compostos estudados. Portanto, é possível concluir que as técnicas de redução de dimensionalidade e de agrupamento podem possivelmente ser utilizadas como guias no processo de descoberta e desenvolvimento de novos compostos na área de Química Medicinal. / Researches in Medicinal Chemistry\'s area have focused on the search of methods that accelerate the process of drug discovery. Among several steps related to the process of discovery of bioactive substances there is the analysis of the relationships between chemical structure and biological activity of compounds. In this process, researchers of medicinal chemistry analyze data sets that are characterized by high dimensionality and small number of observations. Within this context, this work presents a computational approach that aims to contribute to the analysis of chemical data and, consequently, the discovery of new drugs for the treatment of chronic diseases. Approaches used in exploratory data analysis, employed in this work, combine techniques of dimensionality reduction and clustering for detecting natural structures that reflect the biological activity of the analyzed compounds. Among several existing techniques for dimensionality reduction, we have focused the Fisher\'s score, principal component analysis and sparse principal component analysis. For the clustering procedure, this study evaluated k-means, fuzzy c-means and enhanced ICA mixture model. In order to perform experiments, we used four data sets, containing information of bioactive substances. Two sets are related to the treatment of diabetes mellitus and metabolic syndrome, the third set is related to cardiovascular disease and the latter set has substances that can be used in cancer treatment. In the experiments, the obtained results suggest the use of dimensionality reduction techniques along with clustering algorithms for the task of clustering chemical data, since from these experiments, it was possible to describe different levels of biological activity of the studied compounds. Therefore, we conclude that the techniques of dimensionality reduction and clustering can be used as guides in the process of discovery and development of new compounds in the field of Medicinal Chemistry Agrupamento de dados Análise de componentes principais Clustering Dimensionality reduction Principal component analysis Redução de dimensionalidade Seleção de variáveis Sparse principal component analysis Structure activity relationship Variable selection
210	O uso de recursos linguísticos para mensurar a semelhança semântica entre frases curtas através de uma abordagem híbrida Silva, Allan de Barcelos 14 December 2017 (has links) Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-04-04T11:46:54Z No. of bitstreams: 1 Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5) / Made available in DSpace on 2018-04-04T11:46:55Z (GMT). No. of bitstreams: 1 Allan de Barcelos Silva_.pdf: 2298557 bytes, checksum: dc876b1dd44e7a7095219195e809bb88 (MD5) Previous issue date: 2017-12-14 / Nenhuma / Na área de Processamento de Linguagem Natural, a avaliação da similaridade semântica textual é considerada como um elemento importante para a construção de recursos em diversas frentes de trabalho, tais como a recuperação de informações, a classificação de textos, o agrupamento de documentos, as aplicações de tradução, a interação através de diálogos, entre outras. A literatura da área descreve aplicações e técnicas voltadas, em grande parte, para a língua inglesa. Além disso, observa-se o uso prioritário de recursos probabilísticos, enquanto os aspectos linguísticos são utilizados de forma incipiente. Trabalhos na área destacam que a linguística possui um papel fundamental na avaliação de similaridade semântica textual, justamente por ampliar o potencial dos métodos exclusivamente probabilísticos e evitar algumas de suas falhas, que em boa medida são resultado da falta de tratamento mais aprofundado de aspectos da língua. Este contexto é potencializado no tratamento de frases curtas, que consistem no maior campo de utilização das técnicas de similaridade semântica textual, pois este tipo de sentença é composto por um conjunto reduzido de informações, diminuindo assim a capacidade de tratamento probabilístico eficiente. Logo, considera-se vital a identificação e aplicação de recursos a partir do estudo mais aprofundado da língua para melhor compreensão dos aspectos que definem a similaridade entre sentenças. O presente trabalho apresenta uma abordagem para avaliação da similaridade semântica textual em frases curtas no idioma português brasileiro. O principal diferencial apresentado é o uso de uma abordagem híbrida, na qual tanto os recursos de representação distribuída como os aspectos léxicos e linguísticos são utilizados. Para a consolidação do estudo, foi definida uma metodologia que permite a análise de diversas combinações de recursos, possibilitando a avaliação dos ganhos que são introduzidos com a ampliação de aspectos linguísticos e também através de sua combinação com o conhecimento gerado por outras técnicas. A abordagem proposta foi avaliada com relação a conjuntos de dados conhecidos na literatura (evento PROPOR 2016) e obteve bons resultados. / One of the areas of Natural language processing (NLP), the task of assessing the Semantic Textual Similarity (STS) is one of the challenges in NLP and comes playing an increasingly important role in related applications. The STS is a fundamental part of techniques and approaches in several areas, such as information retrieval, text classification, document clustering, applications in the areas of translation, check for duplicates and others. The literature describes the experimentation with almost exclusive application in the English language, in addition to the priority use of probabilistic resources, exploring the linguistic ones in an incipient way. Since the linguistic plays a fundamental role in the analysis of semantic textual similarity between short sentences, because exclusively probabilistic works fails in some way (e.g. identification of far or close related sentences, anaphora) due to lack of understanding of the language. This fact stems from the few non-linguistic information in short sentences. Therefore, it is vital to identify and apply linguistic resources for better understand what make two or more sentences similar or not. The current work presents a hybrid approach, in which are used both of distributed, lexical and linguistic aspects for an evaluation of semantic textual similarity between short sentences in Brazilian Portuguese. We evaluated proposed approach with well-known and respected datasets in the literature (PROPOR 2016) and obtained good results. Processamento de linguagem natural Similaridade semântica textual Linguística Aprendizagem de máquina Support vector machines Word embeddings Principal component analysis Natural language processing Semantic textual similarity Linguistic Machine learning Support vector machines Word embeddings Principal component analysis

Search results