• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 50
  • 18
  • 7
  • 6
  • 4
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 108
  • 108
  • 99
  • 92
  • 20
  • 17
  • 16
  • 16
  • 15
  • 14
  • 13
  • 12
  • 12
  • 11
  • 11
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Spectral methods and computational trade-offs in high-dimensional statistical inference

Wang, Tengyao January 2016 (has links)
Spectral methods have become increasingly popular in designing fast algorithms for modern highdimensional datasets. This thesis looks at several problems in which spectral methods play a central role. In some cases, we also show that such procedures have essentially the best performance among all randomised polynomial time algorithms by exhibiting statistical and computational trade-offs in those problems. In the first chapter, we prove a useful variant of the well-known Davis{Kahan theorem, which is a spectral perturbation result that allows us to bound of the distance between population eigenspaces and their sample versions. We then propose a semi-definite programming algorithm for the sparse principal component analysis (PCA) problem, and analyse its theoretical performance using the perturbation bounds we derived earlier. It turns out that the parameter regime in which our estimator is consistent is strictly smaller than the consistency regime of a minimax optimal (yet computationally intractable) estimator. We show through reduction from a well-known hard problem in computational complexity theory that the difference in consistency regimes is unavoidable for any randomised polynomial time estimator, hence revealing subtle statistical and computational trade-offs in this problem. Such computational trade-offs also exist in the problem of restricted isometry certification. Certifiers for restricted isometry properties can be used to construct design matrices for sparse linear regression problems. Similar to the sparse PCA problem, we show that there is also an intrinsic gap between the class of matrices certifiable using unrestricted algorithms and using polynomial time algorithms. Finally, we consider the problem of high-dimensional changepoint estimation, where we estimate the time of change in the mean of a high-dimensional time series with piecewise constant mean structure. Motivated by real world applications, we assume that changes only occur in a sparse subset of all coordinates. We apply a variant of the semi-definite programming algorithm in sparse PCA to aggregate the signals across different coordinates in a near optimal way so as to estimate the changepoint location as accurately as possible. Our statistical procedure shows superior performance compared to existing methods in this problem.
42

Development of Fourier transform infrared spectroscopy for drug response analysis

Hughes, Caryn Sian January 2011 (has links)
The feasibility of FTIR-based spectroscopy as a tool to measure cellular response to therapeutics was investigated. Fourier transform mid-infrared spectroscopy has been used in conjunction with multivariate analysis (MVA) to assess the chemistry of many clinically relevant biological materials; however, the technique has not yet found its place in a clinical setting. One issue that has held the technique back is due to the spectral distortions caused by resonant Mie scattering (RMieS), which affects the ability to confidently assign molecular assignments to the spectral signals from biomaterials. In the light of recently improved understanding of RMieS, resulting in a novel correction algorithm, the analytical robustness of corrected FTIR spectra was validated against multi-discipline methods to characterise a set of renal cell lines which were selected for their difference in morphology.After validation of the FTIR methodology by discriminating different cell lines, the second stage of analyses tested the sensitivity of FTIR technique by determining if discrete chemical differences could be highlighted within a cell population of the same origin. The renal carcinoma cell line 2245R contains a sub-population to contain a sub-population of cells displaying 'stem-cell like' properties. These stem-like cells, however, are difficult to isolate and characterise by conventional '-omic' means. Finally, cellular response to chemotherapeutics was investigated using the established renal cell lines CAKI-2 and A-498. For the model, 5-fluorouracil (5FU), an established chemotherapeutic agent with known mechanisms of action was used. Novel gold-based therapeutic compounds were also assessed in parallel to determine their efficacy against renal cell carcinoma. The novel compounds displayed initial activity, as the FTIR evidence suggested compounds were able to enter the cells in the first instance, evoking a cellular response. The long-term performance, tracked with standard proliferation assays and FTIR spectroscopy in the renal cancer cell model, however, was poor. Rather than dismissing the compounds as in-active, the compounds may simply be more effective in cancer cell types of a different nature. The FTIR-based evidence provided the means to suggest such a conclusion. Overall, the initial results suggest that the combination of FTIR and MVA, in the presence of the novel RMieS-EMSC algorithm can detect differences in cellular response to chemotherapeutics. The results were also in-line with complimentary biological-based techniques, demonstrating the powerful potential of the technique as a promising drug screening tool.
43

Classify part of day and snow on the load of timber stacks : A comparative study between partitional clustering and competitive learning

Nordqvist, My January 2021 (has links)
In today's society, companies are trying to find ways to utilize all the data they have, which considers valuable information and insights to make better decisions. This includes data used to keeping track of timber that flows between forest and industry. The growth of Artificial Intelligence (AI) and Machine Learning (ML) has enabled the development of ML modes to automate the measurements of timber on timber trucks, based on images. However, to improve the results there is a need to be able to get information from unlabeled images in order to decide weather and lighting conditions. The objective of this study is to perform an extensive for classifying unlabeled images in the categories, daylight, darkness, and snow on the load. A comparative study between partitional clustering and competitive learning is conducted to investigate which method gives the best results in terms of different clustering performance metrics. It also examines how dimensionality reduction affects the outcome. The algorithms K-means and Kohonen Self-Organizing Map (SOM) are selected for the clustering. Each model is investigated according to the number of clusters, size of dataset, clustering time, clustering performance, and manual samples from each cluster. The results indicate a noticeable clustering performance discrepancy between the algorithms concerning the number of clusters, dataset size, and manual samples. The use of dimensionality reduction led to shorter clustering time but slightly worse clustering performance. The evaluation results further show that the clustering time of Kohonen SOM is significantly higher than that of K-means.
44

Analyzing Recycling Habits in Mahoning County, Ohio

Yengwia , Lawrenzo N. January 2017 (has links)
No description available.
45

Sparse Principal Component Analysis for High-Dimensional Data: A Comparative Study

Bonner, Ashley J. 10 1900 (has links)
<p><strong>Background:</strong> Through unprecedented advances in technology, high-dimensional datasets have exploded into many fields of observational research. For example, it is now common to expect thousands or millions of genetic variables (p) with only a limited number of study participants (n). Determining the important features proves statistically difficult, as multivariate analysis techniques become flooded and mathematically insufficient when n < p. Principal Component Analysis (PCA) is a commonly used multivariate method for dimension reduction and data visualization but suffers from these issues. A collection of Sparse PCA methods have been proposed to counter these flaws but have not been tested in comparative detail. <strong>Methods:</strong> Performances of three Sparse PCA methods were evaluated through simulations. Data was generated for 56 different data-structures, ranging p, the number of underlying groups and the variance structure within them. Estimation and interpretability of the principal components (PCs) were rigorously tested. Sparse PCA methods were also applied to a real gene expression dataset. <strong>Results:</strong> All Sparse PCA methods showed improvements upon classical PCA. Some methods were best at obtaining an accurate leading PC only, whereas others were better for subsequent PCs. There exist different optimal choices of Sparse PCA methods when ranging within-group correlation and across-group variances; thankfully, one method repeatedly worked well under the most difficult scenarios. When applying methods to real data, concise groups of gene expressions were detected with the most sparse methods. <strong>Conclusions:</strong> Sparse PCA methods provide a new insightful way to detect important features amidst complex high-dimension data.</p> / Master of Science (MSc)
46

Ionic Characterization of Laundry Detergents: Implications for Consumer Choice and Inland Freshwater Salinization

Mendoza, Kent Gregory 11 April 2024 (has links)
Increased salinity in freshwater systems – also called the Freshwater Salinization Syndrome (FSS) – can have far-ranging implications for the natural and built environment, agriculture, and public health at large. Such risks are clearly on display in the Occoquan Reservoir – a drinking water source for roughly one million people in the northern Virginia/ National Capital Region. Sodium concentrations in the Occoquan Reservoir are approaching levels that can affect taste and health. The Reservoir is also noteworthy as a flagship example of indirect potable reuse, which further adds complexity to understanding the sources of rising levels of sodium and other types of salinity. To help understand the role residential discharges might play in salinization of the Occoquan Reservoir, a suite of laundry detergent products was identified based upon survey data collected in the northern Virginia region. The ionic compositions of these products were then characterized using ion chromatography and inductively coupled plasma-mass spectrometry to quantify select ionic and elemental analytes. Sodium, chloride, and sulfate were consistently found in appreciable amounts. To comparatively characterize the laundry detergents, principal component analysis was employed to identify clusters of similar products. The physical formulation of the products was identified as a marker for their content, with dry formulations (free-flowing and encapsulated powders) being more enriched in sodium and sulfate. This result was corroborated by comparing nonparametric bootstrap intervals for individual analytes. The study's findings suggest an opportunity wherein consumer choice can play a role in mediating residential salt inputs in receiving bodies such as the Occoquan Reservoir. / Master of Science / Many streams, rivers, and other freshwater systems have become increasingly salty in recent decades. A rise in salinity can be problematic, stressing aquatic life, corroding pipes, and even enhancing the release of more pollutants into the water. This phenomenon, called Freshwater Salinization Syndrome, can threaten such systems' ability to serve as sources of drinking water, as is the case for the Occoquan Reservoir in northern Virginia. Serving roughly one million people, the Reservoir is notable for being one of the first in the country to purposely incorporate highly treated wastewater upstream of a drinking water supply. Despite the Reservoir's prominence, the reasons behind its rising salt levels are not well understood. This study sought to understand the role that individual residences could play when household products travel down the drain and are ultimately discharged into the watershed. Laundry detergents are potentially high-salt products. A survey of northern Virginian's laundry habits was conducted to understand local tastes and preferences. Informed by the survey, a suite of laundry detergents was chemically characterized to measure salt and element concentrations. The detergents were found to have notable amounts of sodium, chloride, and sulfate in particular, with sodium being the most abundant analyte in every detergent. However, not all detergents were equally salty; statistical tools revealed that dry formulations (such as powdered and powder-filled pak detergents) contributed more sodium and sulfate, among other things. This study's findings suggest that laundry detergents could be contributing to Freshwater Salinization Syndrome in the Occoquan Reservoir, and that local consumers' choice of detergents could make a difference.
47

Outlier detection with ensembled LSTM auto-encoders on PCA transformed financial data / Avvikelse-detektering med ensemble LSTM auto-encoders på PCA-transformerad finansiell data

Stark, Love January 2021 (has links)
Financial institutions today generate a large amount of data, data that can contain interesting information to investigate to further the economic growth of said institution. There exists an interest in analyzing these points of information, especially if they are anomalous from the normal day-to-day work. However, to find these outliers is not an easy task and not possible to do manually due to the massive amounts of data being generated daily. Previous work to solve this has explored the usage of machine learning to find outliers in these financial datasets. Previous studies have shown that the pre-processing of data usually stands for a big part in information loss. This work aims to study if there is a proper balance in how the pre-processing is carried out to retain the highest amount of information while simultaneously not letting the data remain too complex for the machine learning models. The dataset used consisted of Foreign exchange transactions supplied by the host company and was pre-processed through the use of Principal Component Analysis (PCA). The main purpose of this work is to test if an ensemble of Long Short-Term Memory Recurrent Neural Networks (LSTM), configured as autoencoders, can be used to detect outliers in the data and if the ensemble is more accurate than a single LSTM autoencoder. Previous studies have shown that Ensemble autoencoders can prove more accurate than a single autoencoder, especially when SkipCells have been implemented (a configuration that skips over LSTM cells to make the model perform with more variation). A datapoint will be considered an outlier if the LSTM model has trouble properly recreating it, i.e. a pattern that is hard to classify, making it available for further investigations done manually. The results show that the ensembled LSTM model proved to be more accurate than that of a single LSTM model in regards to reconstructing the dataset, and by our definition of an outlier, more accurate in outlier detection. The results from the pre-processing experiments reveal different methods of obtaining an optimal number of components for your data. One of those is by studying retained variance and accuracy of PCA transformation compared to model performance for a certain number of components. One of the conclusions from the work is that ensembled LSTM networks can prove very powerful, but that alternatives to pre-processing should be explored such as categorical embedding instead of PCA. / Finansinstitut genererar idag en stor mängd data, data som kan innehålla intressant information värd att undersöka för att främja den ekonomiska tillväxten för nämnda institution. Det finns ett intresse för att analysera dessa informationspunkter, särskilt om de är avvikande från det normala dagliga arbetet. Att upptäcka dessa avvikelser är dock inte en lätt uppgift och ej möjligt att göra manuellt på grund av de stora mängderna data som genereras dagligen. Tidigare arbete för att lösa detta har undersökt användningen av maskininlärning för att upptäcka avvikelser i finansiell data. Tidigare studier har visat på att förbehandlingen av datan vanligtvis står för en stor del i förlust av emphinformation från datan. Detta arbete syftar till att studera om det finns en korrekt balans i hur förbehandlingen utförs för att behålla den högsta mängden information samtidigt som datan inte förblir för komplex för maskininlärnings-modellerna. Det emphdataset som användes bestod av valutatransaktioner som tillhandahölls av värdföretaget och förbehandlades genom användning av Principal Component Analysis (PCA). Huvudsyftet med detta arbete är att undersöka om en ensemble av Long Short-Term Memory Recurrent Neural Networks (LSTM), konfigurerad som autoenkodare, kan användas för att upptäcka avvikelser i data och om ensemblen är mer precis i sina predikteringar än en ensam LSTM-autoenkodare. Tidigare studier har visat att en ensembel avautoenkodare kan visa sig vara mer precisa än en singel autokodare, särskilt när SkipCells har implementerats (en konfiguration som hoppar över vissa av LSTM-cellerna för att göra modellerna mer varierade). En datapunkt kommer att betraktas som en avvikelse om LSTM-modellen har problem med att återskapa den väl, dvs ett mönster som nätverket har svårt att återskapa, vilket gör datapunkten tillgänglig för vidare undersökningar. Resultaten visar att en ensemble av LSTM-modeller predikterade mer precist än en singel LSTM-modell när det gäller att återskapa datasetet, och då enligt vår definition av avvikelser, mer precis avvikelse detektering. Resultaten från förbehandlingen visar olika metoder för att uppnå ett optimalt antal komponenter för dina data genom att studera bibehållen varians och precision för PCA-transformation jämfört med modellprestanda. En av slutsatserna från arbetet är att en ensembel av LSTM-nätverk kan visa sig vara mycket kraftfulla, men att alternativ till förbehandling bör undersökas, såsom categorical embedding istället för PCA.
48

Contribution à la modélisation de la qualité de l'orge et du malt pour la maîtrise du procédé de maltage / Modeling contribution of barley and malt quality for the malting process control

Ajib, Budour 18 December 2013 (has links)
Dans un marché en permanente progression et pour répondre aux besoins des brasseurs en malt de qualité, la maîtrise du procédé de maltage est indispensable. La qualité du malt est fortement dépendante des conditions opératoires, en particulier des conditions de trempe, mais également de la qualité de la matière première : l'orge. Dans cette étude, nous avons établi des modèles polynomiaux qui mettent en relation les conditions opératoires et la qualité du malt. Ces modèles ont été couplés à nos algorithmes génétiques et nous ont permis de déterminer les conditions optimales de maltage, soit pour atteindre une qualité ciblée de malt (friabilité), soit pour permettre un maltage à faible teneur en eau (pour réduire la consommation en eau et maîtriser les coûts environnementaux de production) tout en conservant une qualité acceptable de malt. Cependant, la variabilité de la matière première est un facteur limitant de notre approche. Les modèles établis sont en effet très sensibles à l'espèce d'orge (printemps, hiver) ou encore à la variété d'orge utilisée. Les modèles sont surtout très dépendants de l'année de récolte. Les variations observées sur les propriétés d'une année de récolte à une autre sont mal caractérisées et ne sont donc pas intégrées dans nos modèles. Elles empêchent ainsi de capitaliser l'information expérimentale au cours du temps. Certaines propriétés structurelles de l'orge (porosité, dureté) ont été envisagées comme nouveaux facteurs pour mieux caractériser la matière première mais ils n'ont pas permis d'expliquer les variations observés en malterie.Afin de caractériser la matière première, 394 échantillons d'orge issus de 3 années de récolte différentes 2009-2010-2011 ont été analysés par spectroscopie MIR. Les analyses ACP ont confirmé l'effet notable des années de récolte, des espèces, des variétés voire des lieux de culture sur les propriétés de l'orge. Une régression PLS a permis, pour certaines années et pour certaines espèces, de prédire les teneurs en protéines et en béta-glucanes de l'orge à partir des spectres MIR. Cependant, ces résultats, pourtant prometteurs, se heurtent toujours à la variabilité. Ces nouveaux modèles PLS peuvent toutefois être exploités pour mettre en place des stratégies de pilotage du procédé de maltage à partir de mesures spectroscopiques MIR / In a continuously growing market and in order to meet the needs of Brewers in high quality malt, control of the malting process is a great challenge. Malt quality is highly dependent on the malting process operating conditions, especially on the steeping conditions, but also the quality of the raw material: barley. In this study, we established polynomial models that relate the operating conditions and the malt quality. These models have been coupled with our genetic algorithms to determine the optimal steeping conditions, either to obtain a targeted quality of malt (friability), or to allow a malting at low water content while maintaining acceptable quality of malt (to reduce water consumption and control the environmental costs of malt production). However, the variability of the raw material is a limiting factor for our approach. Established models are very sensitive to the species (spring and winter barley) or to the barley variety. The models are especially highly dependent on the crop year. Variations on the properties of a crop from one to another year are poorly characterized and are not incorporated in our models. They thus prevent us to capitalize experimental information over time. Some structural properties of barley (porosity, hardness) were considered as new factors to better characterize barley but they did not explain the observed variations.To characterize barley, 394 samples from 3 years of different crops 2009-2010-2011 were analysed by MIR spectroscopy. ACP analyses have confirmed the significant effect of the crop-years, species, varieties and sometimes of places of harvest on the properties of barley. A PLS regression allowed, for some years and for some species, to predict content of protein and beta-glucans of barley using MIR spectra. These results thus still face product variability, however, these new PLS models are very promising and could be exploited to implement control strategies in malting process using MIR spectroscopic measurements
49

Identificação rápida de contaminantes microbianos em produtos farmacêuticos / Rapid identification of microbial contaminants in pharmaceutical products

Brito, Natalia Monte Rubio de 12 June 2019 (has links)
A qualidade microbiológica de medicamentos é fundamental para garantir sua eficácia e segurança. Os métodos convencionais para identificação microbiana em produtos não estéreis são amplamente utilizados, entretanto são demorados e trabalhosos. O objetivo deste trabalho é desenvolver método microbiológico rápido (MMR) para a identificação de contaminantes em produtos farmacêuticos utilizando a espectrofotometria de infravermelho com transformada de Fourier com reflectância total atenuada (FTIR-ATR). Análise de componentes principais (PCA) e análise de discriminantes (LDA) foram utilizadas para obter um modelo de predição com a capacidade de diferenciar o crescimento de oriundo de contaminação por Bacillus subtilis (ATCC 6633), Candida albicans (ATCC 10231), Enterococcus faecium (ATCC 8459), Escherichia coli (ATCC 8739), Micrococcus luteus (ATCC 10240), Pseudomonas aeruginosa (ATCC 9027), Salmonella Typhimurium (ATCC 14028), Staphylococcus aureus (ATCC 6538) e Staphylococcus epidermidis (ATCC 12228). Os espectros de FTIR-ATR forneceram informações quanto à composição de proteínas, DNA/RNA, lipídeos e carboidratos provenientes do crescimento microbiano. As identificações microbianas fornecidas pelo modelo PCA/LDA baseado no método FTIR-ATR foram compatíveis com aquelas obtidas pelos métodos microbiológicos convencionais. O método de identificação microbiana rápida por FTIR-ATR foi validado quanto à sensibilidade (93,5%), especificidade (83,3%) e limite de detecção (17-23 UFC/mL de amostra). Portanto, o MMR proposto neste trabalho pode ser usado para fornecer uma identificação rápida de contaminantes microbianos em produtos farmacêuticos. / Microbiological quality of pharmaceuticals is fundamental in ensuring efficacy and safety of medicines. Conventional methods for microbial identification in non-sterile drugs are widely used, however are time-consuming and laborious. The aim of this paper was to develop a rapid microbiological method (RMM) for identification of contaminants in pharmaceutical products using Fourier transform infrared with attenuated total reflectance spectrometry (FTIR-ATR). Principal components analysis (PCA) and linear discriminant analysis (LDA) were used to obtain a predictive model with capable to distinguish Bacillus subtilis (ATCC 6633), Candida albicans (ATCC 10231), Enterococcus faecium (ATCC 8459), Escherichia coli (ATCC 8739), Micrococcus luteus (ATCC 10240), Pseudomonas aeruginosa (ATCC 9027), Salmonella Typhimurium (ATCC 14028), Staphylococcus aureus (ATCC 6538), and Staphylococcus epidermidis (ATCC 12228) microbial growth. FTIR-ATR spectra provide information of protein, DNA/RNA, lipids, and carbohydrates constitution of microbial growth. Microbial identification provided by PCA/LDA based on FTIR-ATR method were compatible to those obtained using conventional microbiological methods. FTIR-ATR method for rapid identification of microbial contaminants in pharmaceutical products was validated by assessing the sensitivity (93.5%), specificity (83.3%), and limit of detection (17-23 CFU/mL of sample). Therefore, the RMM proposed in this work may be used to provide a rapid identification of microbial contaminants in pharmaceutical products.
50

Développement de méthodes d'analyse de données en ligne / Development of methods to analyze data steams

Bar, Romain 29 November 2013 (has links)
On suppose que des vecteurs de données de grande dimension arrivant en ligne sont des observations indépendantes d'un vecteur aléatoire. Dans le second chapitre, ce dernier, noté Z, est partitionné en deux vecteurs R et S et les observations sont supposées identiquement distribuées. On définit alors une méthode récursive d'estimation séquentielle des r premiers facteurs de l'ACP projetée de R par rapport à S. On étudie ensuite le cas particulier de l'analyse canonique, puis de l'analyse factorielle discriminante et enfin de l'analyse factorielle des correspondances. Dans chacun de ces cas, on définit plusieurs processus spécifiques à l'analyse envisagée. Dans le troisième chapitre, on suppose que l'espérance En du vecteur aléatoire Zn dont sont issues les observations varie dans le temps. On note Rn = Zn - En et on suppose que les vecteurs Rn forment un échantillon indépendant et identiquement distribué d'un vecteur aléatoire R. On définit plusieurs processus d'approximation stochastique pour estimer des vecteurs directeurs des axes principaux d'une analyse en composantes principales (ACP) partielle de R. On applique ensuite ce résultat au cas particulier de l'analyse canonique généralisée (ACG) partielle après avoir défini un processus d'approximation stochastique de type Robbins-Monro de l'inverse d'une matrice de covariance. Dans le quatrième chapitre, on considère le cas où à la fois l'espérance et la matrice de covariance de Zn varient dans le temps. On donne finalement des résultats de simulation dans le chapitre 5 / High dimensional data are supposed to be independent on-line observations of a random vector. In the second chapter, the latter is denoted by Z and sliced into two random vectors R et S and data are supposed to be identically distributed. A recursive method of sequential estimation of the factors of the projected PCA of R with respect to S is defined. Next, some particular cases are investigated : canonical correlation analysis, canonical discriminant analysis and canonical correspondence analysis ; in each case, several specific methods for the estimation of the factors are proposed. In the third chapter, data are observations of the random vector Zn whose expectation En varies with time. Let Rn = Zn - En be and suppose that the vectors Rn form an independent and identically distributed sample of a random vector R. Stochastic approximation processes are used to estimate on-line direction vectors of the principal axes of a partial principal components analysis (PCA) of ~Z. This is applied next to the particular case of a partial generalized canonical correlation analysis (gCCA) after defining a stochastic approximation process of the Robbins-Monro type to estimate recursively the inverse of a covariance matrix. In the fourth chapter, the case when both expectation and covariance matrix of Zn vary with time n is considered. Finally, simulation results are given in chapter 5

Page generated in 0.0509 seconds