91 |
Classification models for high-dimensional data with sparsity patternsTillander, Annika January 2013 (has links)
Today's high-throughput data collection devices, e.g. spectrometers and gene chips, create information in abundance. However, this poses serious statistical challenges, as the number of features is usually much larger than the number of observed units. Further, in this high-dimensional setting, only a small fraction of the features are likely to be informative for any specific project. In this thesis, three different approaches to the two-class supervised classification in this high-dimensional, low sample setting are considered. There are classifiers that are known to mitigate the issues of high-dimensionality, e.g. distance-based classifiers such as Naive Bayes. However, these classifiers are often computationally intensive and therefore less time-consuming for discrete data. Hence, continuous features are often transformed into discrete features. In the first paper, a discretization algorithm suitable for high-dimensional data is suggested and compared with other discretization approaches. Further, the effect of discretization on misclassification probability in high-dimensional setting is evaluated. Linear classifiers are more stable which motivate adjusting the linear discriminant procedure to high-dimensional setting. In the second paper, a two-stage estimation procedure of the inverse covariance matrix, applying Lasso-based regularization and Cuthill-McKee ordering is suggested. The estimation gives a block-diagonal approximation of the covariance matrix which in turn leads to an additive classifier. In the third paper, an asymptotic framework that represents sparse and weak block models is derived and a technique for block-wise feature selection is proposed. Probabilistic classifiers have the advantage of providing the probability of membership in each class for new observations rather than simply assigning to a class. In the fourth paper, a method is developed for constructing a Bayesian predictive classifier. Given the block-diagonal covariance matrix, the resulting Bayesian predictive and marginal classifier provides an efficient solution to the high-dimensional problem by splitting it into smaller tractable problems. The relevance and benefits of the proposed methods are illustrated using both simulated and real data. / Med dagens teknik, till exempel spektrometer och genchips, alstras data i stora mängder. Detta överflöd av data är inte bara till fördel utan orsakar även vissa problem, vanligtvis är antalet variabler (p) betydligt fler än antalet observation (n). Detta ger så kallat högdimensionella data vilket kräver nya statistiska metoder, då de traditionella metoderna är utvecklade för den omvända situationen (p<n). Dessutom är det vanligtvis väldigt få av alla dessa variabler som är relevanta för något givet projekt och styrkan på informationen hos de relevanta variablerna är ofta svag. Därav brukar denna typ av data benämnas som gles och svag (sparse and weak). Vanligtvis brukar identifiering av de relevanta variablerna liknas vid att hitta en nål i en höstack. Denna avhandling tar upp tre olika sätt att klassificera i denna typ av högdimensionella data. Där klassificera innebär, att genom ha tillgång till ett dataset med både förklaringsvariabler och en utfallsvariabel, lära en funktion eller algoritm hur den skall kunna förutspå utfallsvariabeln baserat på endast förklaringsvariablerna. Den typ av riktiga data som används i avhandlingen är microarrays, det är cellprov som visar aktivitet hos generna i cellen. Målet med klassificeringen är att med hjälp av variationen i aktivitet hos de tusentals gener (förklaringsvariablerna) avgöra huruvida cellprovet kommer från cancervävnad eller normalvävnad (utfallsvariabeln). Det finns klassificeringsmetoder som kan hantera högdimensionella data men dessa är ofta beräkningsintensiva, därav fungera de ofta bättre för diskreta data. Genom att transformera kontinuerliga variabler till diskreta (diskretisera) kan beräkningstiden reduceras och göra klassificeringen mer effektiv. I avhandlingen studeras huruvida av diskretisering påverkar klassificeringens prediceringsnoggrannhet och en mycket effektiv diskretiseringsmetod för högdimensionella data föreslås. Linjära klassificeringsmetoder har fördelen att vara stabila. Nackdelen är att de kräver en inverterbar kovariansmatris och vilket kovariansmatrisen inte är för högdimensionella data. I avhandlingen föreslås ett sätt att skatta inversen för glesa kovariansmatriser med blockdiagonalmatris. Denna matris har dessutom fördelen att det leder till additiv klassificering vilket möjliggör att välja hela block av relevanta variabler. I avhandlingen presenteras även en metod för att identifiera och välja ut blocken. Det finns också probabilistiska klassificeringsmetoder som har fördelen att ge sannolikheten att tillhöra vardera av de möjliga utfallen för en observation, inte som de flesta andra klassificeringsmetoder som bara predicerar utfallet. I avhandlingen förslås en sådan Bayesiansk metod, givet den blockdiagonala matrisen och normalfördelade utfallsklasser. De i avhandlingen förslagna metodernas relevans och fördelar är visade genom att tillämpa dem på simulerade och riktiga högdimensionella data.
|
92 |
Machine Learning Techniques for Large-Scale System ModelingLv, Jiaqing 31 August 2011 (has links)
This thesis is about some issues in system modeling: The first is a parsimonious
representation of MISO Hammerstein system, which is by projecting the multivariate
linear function into a univariate input function space. This leads to the so-called
semiparamtric Hammerstein model, which overcomes the commonly known “Curse
of dimensionality” for nonparametric estimation on MISO systems. The second issue
discussed in this thesis is orthogonal expansion analysis on a univariate Hammerstein
model and hypothesis testing for the structure of the nonlinear subsystem. The generalization
of this technique can be used to test the validity for parametric assumptions
of the nonlinear function in Hammersteim models. It can also be applied to approximate
a general nonlinear function by a certain class of parametric function in the
Hammerstein models. These techniques can also be extended to other block-oriented
systems, e.g, Wiener systems, with slight modification. The third issue in this thesis is
applying machine learning and system modeling techniques to transient stability studies
in power engineering. The simultaneous variable section and estimation lead to a
substantially reduced complexity and yet possesses a stronger prediction power than
techniques known in the power engineering literature so far.
|
93 |
Duomenų dimensijos mažinimas naudojant autoasociatyvinius neuroninius tinklus / Data dimensionality reduction using autoassociative neural networksBendinskienė, Janina 31 July 2012 (has links)
Šiame magistro darbe apžvelgiami daugiamačių duomenų dimensijos mažinimo (vizualizavimo) metodai, tarp kurių nagrinėjami dirbtiniai neuroniniai tinklai. Pateikiamos pagrindinės dirbtinių neuroninių tinklų sąvokos (biologinis neuronas ir dirbtinio neurono modelis, mokymo strategijos, daugiasluoksnis neuronas ir pan.). Nagrinėjami autoasociatyviniai neuroniniai tinklai. Darbo tikslas – išnagrinėti autoasociatyviųjų neuroninių tinklų taikymo daugiamačių duomenų dimensijos mažinimui ir vizualizavimui galimybes bei ištirti gaunamų rezultatų priklausomybę nuo skirtingų parametrų. Siekiant šio tikslo atlikti eksperimentai naudojant kelias daugiamačių duomenų aibes. Tyrimų metu nustatyti parametrai, įtakojantys autoasociatyvinio neuroninio tinklo veikimą. Be to, gauti rezultatai lyginti pagal dvi skirtingas tinklo daromas paklaidas – MDS ir autoasociatyvinę. MDS paklaida parodo, kaip gerai išlaikomi atstumai tarp analizuojamų taškų (vektorių) pereinant iš daugiamatės erdvės į mažesnės dimensijos erdvę. Autoasociatyvinio tinklo išėjimuose gautos reikšmės turi sutapti su įėjimo reikšmėmis, taigi autoasociatyvinė paklaida parodo, kaip gerai tai gaunama (vertinamas skirtumas tarp įėjimų ir išėjimų). Tirta, kaip paklaidas įtakoja šie autoasociatyvinio neuroninio tinklo parametrai: aktyvacijos funkcija, minimizuojama funkcija, mokymo funkcija, epochų skaičius, paslėptų neuronų skaičius ir dimensijos mažinimo skaičiaus pasirinkimas. / This thesis gives an overview of dimensionality reduction of multivariate data (visualization) techniques, including the issue of artificial neural networks. Presents the main concepts of artificial neural networks (biological and artificial neuron to neuron model, teaching strategies, multi-neuron and so on.). Autoassociative neural networks are analyzed. The aim of this work - to consider the application of autoassociative neural networks for multidimensional data visualization and dimension reduction and to explore the possibilities of the results obtained from the dependence of different parameters. To achieve this, several multidimensional data sets were used. In analysis determinate parameters influencing autoassociative neural network effect. In addition, the results obtained by comparing two different network made errors - MDS and autoassociative. MDS error shows how well maintained the distance between the analyzed points (vectors), in transition from the multidimensional space into a lower dimension space. Autoassociative network output values obtained should coincide with the input values, so autoassociative error shows how well it is received (evaluated the difference between inputs and outputs). Researched how autoassociative neural network errors are influenced by this parameters: the activation function, minimizing function, training function, the number of epochs, hidden neurons number and choices of the number of dimension reduction.
|
94 |
Reading the Sky : From Starspots to Spotting StarsEriksson, Urban January 2014 (has links)
This thesis encompasses two research fields in astronomy: astrometry and astronomy education and they are discussed in two parts. These parts represent two sides of a coin; astrometry, which is about constructing 3D representations of the Universe, and AER, where for this thesis, the goal is to investigate university students’ and lecturers’ disciplinary discernment vis-à-vis the structure of the Universe and extrapolating three-dimensionality. Part I presents an investigation of stellar surface structures influence on ultra-high-precision astrometry. The expected effects in different regions of the HR-diagram were quantified. I also investigated the astrometric effect of exoplanets, since astrometric detection will become possible with projects such as Gaia. Stellar surface structures produce small brightness variations, influencing integrated properties such as the total flux, radial velocity and photocenter position. These properties were modelled and statistical relations between the variations of the different properties were derived. From the models it is clear that for most stellar types the astrometric jitter due to stellar surface structures is expected to be of order 10 μAU or greater. This is more than the astrometric displacement typically caused by an Earth-sized exoplanet in the habitable zone, which is about 1–4 μAU, making astrometric detection difficult. Part II presents an investigation of disciplinary discernment at the university level. Astronomy education is a particularly challenging experience for students because discernment of the ‘real’ Universe is problematic, making interpretation of the many disciplinary-specific representations used an important educational issue. The ability to ‘fluently’ discern the disciplinary affordances of these representations becomes crucial for the effective learning of astronomy. To understand the Universe I conclude that specific experiences are called. Simulations could offer these experiences, where parallax motion is a crucial component. In a qualitative study, I have analysed students’ and lecturers’ discernment while watching a simulation video, and found hierarchies that characterize the discernment in terms of three-dimensionality extrapolation and an Anatomy of Disciplinary Discernment. I combined these to define a new construct: Reading the Sky. I conclude that this is a vital competency needed for learning astronomy and suggest strategies for how to implement this in astronomy education.
|
95 |
Machine Learning Techniques for Large-Scale System ModelingLv, Jiaqing 31 August 2011 (has links)
This thesis is about some issues in system modeling: The first is a parsimonious
representation of MISO Hammerstein system, which is by projecting the multivariate
linear function into a univariate input function space. This leads to the so-called
semiparamtric Hammerstein model, which overcomes the commonly known “Curse
of dimensionality” for nonparametric estimation on MISO systems. The second issue
discussed in this thesis is orthogonal expansion analysis on a univariate Hammerstein
model and hypothesis testing for the structure of the nonlinear subsystem. The generalization
of this technique can be used to test the validity for parametric assumptions
of the nonlinear function in Hammersteim models. It can also be applied to approximate
a general nonlinear function by a certain class of parametric function in the
Hammerstein models. These techniques can also be extended to other block-oriented
systems, e.g, Wiener systems, with slight modification. The third issue in this thesis is
applying machine learning and system modeling techniques to transient stability studies
in power engineering. The simultaneous variable section and estimation lead to a
substantially reduced complexity and yet possesses a stronger prediction power than
techniques known in the power engineering literature so far.
|
96 |
Reading the sky : from starspots to spotting starsEriksson, Urban January 2014 (has links)
This thesis encompasses two research fields in astronomy: astrometry and astronomy education and they are discussed in two parts. These parts represent two sides of a coin; astrometry, which is about constructing 3D representations of the Universe, and AER, where for this thesis, the goal is to investigate university students’ and lecturers’ disciplinary discernment vis-à-vis the structure of the Universe and extrapolating three-dimensionality. Part I presents an investigation of stellar surface structures influence on ultra-high-precision astrometry. The expected effects in different regions of the HR-diagram were quantified. I also investigated the astrometric effect of exoplanets, since astrometric detection will become possible with projects such as Gaia. Stellar surface structures produce small brightness variations, influencing integrated properties such as the total flux, radial velocity and photocenter position. These properties were modelled and statistical relations between the variations of the different properties were derived. From the models it is clear that for most stellar types the astrometric jitter due to stellar surface structures is expected to be of order 10 μAU or greater. This is more than the astrometric displacement typically caused by an Earth-sized exoplanet in the habitable zone, which is about 1–4 μAU, making astrometric detection difficult. Part II presents an investigation of disciplinary discernment at the university level. Astronomy education is a particularly challenging experience for students because discernment of the ‘real’ Universe is problematic, making interpretation of the many disciplinary-specific representations used an important educational issue. The ability to ‘fluently’ discern the disciplinary affordances of these representations becomes crucial for the effective learning of astronomy. To understand the Universe I conclude that specific experiences are called. Simulations could offer these experiences, where parallax motion is a crucial component. In a qualitative study, I have analysed students’ and lecturers’ discernment while watching a simulation video, and found hierarchies that characterize the discernment in terms of three-dimensionality extrapolation and an Anatomy of Disciplinary Discernment. I combined these to define a new construct: Reading the Sky. I conclude that this is a vital competency needed for learning astronomy and suggest strategies for how to implement this in astronomy education.
|
97 |
Isometry and convexity in dimensionality reductionVasiloglou, Nikolaos 30 March 2009 (has links)
The size of data generated every year follows an exponential growth. The number of data points as well as the dimensions have increased dramatically the past 15 years. The gap between the demand from the industry in data processing and the solutions provided by the machine learning community is increasing. Despite the growth in memory and computational power, advanced statistical processing on the order of gigabytes is beyond any possibility. Most sophisticated Machine Learning algorithms require at least quadratic complexity. With the current computer model architecture, algorithms with higher complexity than linear O(N) or O(N logN) are not considered practical. Dimensionality reduction is a challenging problem in machine learning. Often data represented as multidimensional points happen to have high dimensionality. It turns out that the information they carry can be expressed with much less dimensions. Moreover the reduced dimensions of the data can have better interpretability than the original ones. There is a great variety of dimensionality reduction algorithms under the theory of Manifold Learning. Most of the methods such as Isomap, Local Linear Embedding, Local Tangent Space Alignment, Diffusion Maps etc. have been extensively studied under the framework of Kernel Principal Component Analysis (KPCA). In this dissertation we study two current state of the art dimensionality reduction methods, Maximum Variance Unfolding (MVU) and Non-Negative Matrix Factorization (NMF). These two dimensionality reduction methods do not fit under the umbrella of Kernel PCA. MVU is cast as a Semidefinite Program, a modern convex nonlinear optimization algorithm, that offers more flexibility and power compared to iv KPCA. Although MVU and NMF seem to be two disconnected problems, we show that there is a connection between them. Both are special cases of a general nonlinear factorization algorithm that we developed. Two aspects of the algorithms are of particular interest: computational complexity and interpretability. In other words computational complexity answers the question of how fast we can find the best solution of MVU/NMF for large data volumes. Since we are dealing with optimization programs, we need to find the global optimum. Global optimum is strongly connected with the convexity of the problem. Interpretability is strongly connected with local isometry1 that gives meaning in relationships between data points. Another aspect of interpretability is association of data with labeled information. The contributions of this thesis are the following:
1. MVU is modified so that it can scale more efficient. Results are shown on 1 million speech datasets. Limitations of the method are highlighted.
2. An algorithm for fast computations for the furthest neighbors is presented for the first time in the literature.
3. Construction of optimal kernels for Kernel Density Estimation with modern convex programming is presented. For the first time we show that the Leave One Cross Validation (LOOCV) function is quasi-concave.
4. For the first time NMF is formulated as a convex optimization problem
5. An algorithm for the problem of Completely Positive Matrix Factorization is presented.
6. A hybrid algorithm of MVU and NMF the isoNMF is presented combining advantages of both methods.
7. The Isometric Separation Maps (ISM) a variation of MVU that contains classification information is presented.
8. Large scale nonlinear dimensional analysis on the TIMIT speech database is performed.
9. A general nonlinear factorization algorithm is presented based on sequential convex programming. Despite the efforts to scale the proposed methods up to 1 million data points in reasonable time, the gap between the industrial demand and the current state of the art is still orders of magnitude wide.
|
98 |
Reducing the dimensionality of hyperspectral remotely sensed data with applications for maximum likelihood image classificationSantich, Norman Ty January 2007 (has links)
As well as the many benefits associated with the evolution of multispectral sensors into hyperspectral sensors there is also a considerable increase in storage space and the computational load to process the data. Consequently the remote sensing ommunity is investigating and developing statistical methods to alleviate these problems. / The research presented here investigates several approaches to reducing the dimensionality of hyperspectral remotely sensed data while maintaining the levels of accuracy achieved using the full dimensionality of the data. It was conducted with an emphasis on applications in maximum likelihood classification (MLC) of hyperspectral image data. An inherent characteristic of hyperspectral data is that adjacent bands are typically highly correlated and this results in a high level of redundancy in the data. The high correlations between adjacent bands can be exploited to realise significant reductions in the dimensionality of the data, for a negligible reduction in classification accuracy. / The high correlations between neighbouring bands is related to their response functions overlapping with each other by a large amount. The spectral band filter functions were modelled for the HyMap instrument that acquires hyperspectral data used in this study. The results were compared with measured filter function data from a similar, more recent HyMap instrument. The results indicated that on average HyMap spectral band filter functions exhibit overlaps with their neighbouring bands of approximately 60%. This is considerable and partly accounts for the high correlation between neighbouring spectral bands on hyperspectral instruments. / A hyperspectral HyMap image acquired over an agricultural region in the south west of Western Australia has been used for this research. The image is composed of 512 × 512 pixels, with each pixel having a spatial resolution of 3.5 m. The data was initially reduced from 128 spectral bands to 82 spectral bands by removing the highly overlapping spectral bands, those which exhibit high levels of noise and those bands located at strong atmospheric absorption wavelengths. The image was examined and found to contain 15 distinct spectral classes. Training data was selected for each of these classes and class spectral mean and covariance matrices were generated. / The discriminant function for MLC makes use of not only the measured pixel spectra but also the sample class covariance matrices. This thesis first examines reducing the parameterization of these covariance matrices for use by the MLC algorithm. The full dimensional spectra are still used for the classification but the number of parameters needed to describe the covariance information is significantly reduced. When a threshold of 0.04 was used in conjunction with the partial correlation matrices to identify low values in the inverse covariance matrices, the resulting classification accuracy was 96.42%. This was achieved using only 68% of the elements in the original covariance matrices. / Both wavelet techniques and cubic splines were investigated as a means of representing the measured pixel spectra with considerably fewer bands. Of the different mother wavelets used, it was found that the Daubechies-4 wavelet performed slightly better than the Haar and Daubechies-6 wavelets at generating accurate spectra with the least number of parameters. The wavelet techniques investigated produced more accurately modelled spectra compared with cubic splines with various knot selection approaches. A backward stepwise knot selection technique was identified to be more effective at approximating the spectra than using regularly spaced knots. A forward stepwise selection technique was investigated but was determined to be unsuited to this process. / All approaches were adapted to process an entire hyperspectral image and the subsequent images were classified using MLC. Wavelet approximation coefficients gave slightly better classification results than wavelet detail coefficients and the Haar wavelet proved to be a more superior wavelet for classification purposes. With 6 approximation coefficients, the Haar wavelet could be used to classify the data with an accuracy of 95.6%. For 11 approximation coefficients this figure increased to 96.1%. / First and second derivative spectra were also used in the classification of the image. The first and second derivatives were determined for each of the class spectral means and for each band the standard deviations were calculated of both the first and second derivatives. Bands were then ranked in order of decreasing standard deviation. Bands showing the highest standard deviations were identified and the derivatives were generated for the entire image at these wavelengths. The resulting first and second derivative images were then classified using MLC. Using 25 spectral bands classification accuracies of approximately 96% and 95% were achieved using the first and second derivative images respectively. These results are comparable with those from using wavelets although wavelets produced higher classification accuracies when fewer coefficients were used.
|
99 |
Métodos de redução de dimensionalidade aplicados na seleção genômica para características de carcaça em suínos / Dimensionality reduction methods applied to genomic selection for carcass traits in pigsAzevedo, Camila Ferreira 26 July 2012 (has links)
Made available in DSpace on 2015-03-26T13:32:15Z (GMT). No. of bitstreams: 1
texto completo.pdf: 1216352 bytes, checksum: 3e5fbc09a6f684ddf7dbb4442657ce1f (MD5)
Previous issue date: 2012-07-26 / The main contribution of molecular genetics is the direct use of DNA information to identify genetically superior individuals. Under this approach, genome-wide selection (GWS) can be used with this purpose. GWS consists in analyzing of a large number of SNP markers widely distributed in the genome, and due to the fact that the number of markers is much larger than the number of genotyped individuals (high dimensionality) and also to the fact that such markers are highly correlated (multicollinearity). However, the use of methodologies that address the adversities is fundamental to the success of genome wide selection. In view of, the aim of this dissertation was to propose the application of Independent Component Regression (ICR), Principal Component Regression (PCR), Partial Least Squares (PLS) and Random Regression Best Linear Unbiased Predictor, whereas carcass traits in an F2 population of pigs originated from the cross of two males from the naturalized Brazilian breed Piau with 18 females of a commercial line (Large White × Landrace × Pietrain), developed at the University Federal of Viçosa. The specific objectives were, to estimate Genomic Breeding Value (GBV) for each individual and estimate the effects of SNP markers in order to compare methods. The results showed that ICR method is more efficient, since provided most accurate genomic breeding values estimates for most carcass traits. / A principal contribuição da genética molecular no melhoramento animal é a utilização direta das informações de DNA no processo de identificação de animais geneticamente superiores. Sob esse enfoque, a seleção genômica ampla (Genome Wide Selection GWS), a qual consiste na análise de um grande número de marcadores SNPs (Single Nucleotide Polymorphisms) amplamente distribuídos no genoma, foi idealizada. A utilização dessas informações é um desafio, uma vez que o número de marcadores é muito maior que o número de animais genotipados (alta dimensionalidade) e tais marcadores são altamente correlacionados (multicolinearidade). No entanto, o sucesso da seleção genômica ampla deve-se a escolha de metodologias que contemplem essas adversidades. Diante do exposto, o presente trabalho teve por objetivo propor a aplicação dos métodos de regressão via Componentes Independentes (Independent Component Regression ICR), regressão via componentes principais (Principal Component Regression PCR), regressão via Quadrados Mínimos Parciais (Partial Least Squares PLSR) e RR-BLUP, considerando características de carcaça em uma população F2 de suínos proveniente do cruzamento de dois varrões da raça naturalizada brasileira Piau com 18 fêmeas de linhagem comercial (Landrace × Large White × Pietrain), desenvolvida na Universidade Federal de Viçosa. Os objetivos específicos foram estimar Valores Genéticos Genômicos (Genomic Breeding Values GBV) para cada indivíduo avaliado e estimar efeitos de marcadores SNPs, visando a comparação dos métodos. Os resultados indicaram que o método ICR se mostrou mais eficiente, uma vez que este proporcionou maiores valores de acurácia na estimação do GBV para a maioria das características de carcaça.
|
100 |
Sparse Bayesian Time-Varying Covariance Estimation in Many DimensionsKastner, Gregor 18 September 2016 (has links) (PDF)
Dynamic covariance estimation for multivariate time series suffers from the curse of dimensionality. This renders parsimonious estimation methods essential for conducting reliable statistical inference. In this paper, the issue is addressed by modeling the underlying co-volatility dynamics of a time series vector through a lower dimensional collection of latent time-varying stochastic factors. Furthermore, we apply a Normal-Gamma prior to the elements of the factor loadings matrix. This hierarchical shrinkage prior effectively pulls the factor loadings of unimportant factors towards zero, thereby increasing parsimony even more. We apply the model to simulated data as well as daily log-returns of 300 S&P 500 stocks and demonstrate the effectiveness of the shrinkage prior to obtain sparse loadings matrices and more precise correlation estimates. Moreover, we investigate predictive performance and discuss different choices for the number of latent factors. Additionally to being a stand-alone tool, the algorithm is designed to act as a "plug and play" extension for other MCMC samplers; it is implemented in the R package factorstochvol. (author's abstract) / Series: Research Report Series / Department of Statistics and Mathematics
|
Page generated in 0.1139 seconds