81 |
Extração de tópicos baseado em agrupamento de regras de associação / Topic extraction based on association rule clusteringFabiano Fernandes dos Santos 29 May 2015 (has links)
Uma representação estruturada dos documentos em um formato apropriado para a obtenção automática de conhecimento, sem que haja perda de informações relevantes em relação ao formato originalmente não-estruturado, é um dos passos mais importantes da mineração de textos, pois a qualidade dos resultados obtidos com as abordagens automáticas para obtenção de conhecimento de textos estão fortemente relacionados à qualidade dos atributos utilizados para representar a coleção de documentos. O Modelo de Espaço de Vetores (MEV) é um modelo tradicional para obter uma representação estruturada dos documentos. Neste modelo, cada documento é representado por um vetor de pesos correspondentes aos atributos do texto. O modelo bag-of-words é a abordagem de MEV mais utilizada devido a sua simplicidade e aplicabilidade. Entretanto, o modelo bag-of-words não trata a dependência entre termos e possui alta dimensionalidade. Diversos modelos para representação dos documentos foram propostos na literatura visando capturar a informação de relação entre termos, destacando-se os modelos baseados em frases ou termos compostos, o Modelo de Espaço de Vetores Generalizado (MEVG) e suas extensões, modelos de tópicos não-probabilísticos, como o Latent Semantic Analysis (LSA) ou o Non-negative Matrix Factorization (NMF), e modelos de tópicos probabilísticos, como o Latent Dirichlet Allocation (LDA) e suas extensões. A representação baseada em modelos de tópicos é uma das abordagens mais interessantes uma vez que elas fornece uma estrutura que descreve a coleção de documentos em uma forma que revela sua estrutura interna e as suas inter-relações. As abordagens de extração de tópicos também fornecem uma estratégia de redução da dimensionalidade visando a construção de novas dimensões que representam os principais tópicos ou assuntos identificados na coleção de documentos. Entretanto, a extração é eficiente de informações sobre as relações entre os termos para construção da representação de documentos ainda é um grande desafio de pesquisa. Os modelos para representação de documentos que exploram a correlação entre termos normalmente enfrentam um grande desafio para manter um bom equilíbrio entre (i) a quantidade de dimensões obtidas, (ii) o esforço computacional e (iii) a interpretabilidade das novas dimensões obtidas. Assim,é proposto neste trabalho o modelo para representação de documentos Latent Association Rule Cluster based Model (LARCM). Este é um modelo de extração de tópicos não-probabilístico que explora o agrupamento de regras de associação para construir uma representação da coleção de documentos com dimensionalidade reduzida tal que as novas dimensões são extraídas a partir das informações sobre as relações entre os termos. No modelo proposto, as regras de associação são extraídas para cada documento para obter termos correlacionados que formam expressões multi-palavras. Essas relações entre os termos formam o contexto local da relação entre termos. Em seguida, aplica-se um processo de agrupamento em todas as regras de associação para formar o contexto geral das relações entre os termos, e cada grupo de regras de associação obtido formará um tópico, ou seja, uma dimensão da representação. Também é proposto neste trabalho uma metodologia de avaliação que permite selecionar modelos que maximizam tanto os resultados na tarefa de classificação de textos quanto os resultados de interpretabilidade dos tópicos obtidos. O modelo LARCM foi comparado com o modelo LDA tradicional e o modelo LDA utilizando uma representação que inclui termos compostos (bag-of-related-words). Os resultados dos experimentos indicam que o modelo LARCM produz uma representação para os documentos que contribui significativamente para a melhora dos resultados na tarefa de classificação de textos, mantendo também uma boa interpretabilidade dos tópicos obtidos. O modelo LARCM também apresentou ótimo desempenho quando utilizado para extração de informação de contexto para aplicação em sistemas de recomendação sensíveis ao contexto. / A structured representation of documents in an appropriate format for the automatic knowledge extraction without loss of relevant information is one of the most important steps of text mining, since the quality of the results obtained with automatic approaches for the text knowledge extraction is strongly related to the quality of the selected attributes to represent the collection of documents. The Vector Space model (VSM) is a traditional structured representation of documents. In this model, each document is represented as a vector of weights that corresponds to the features of the document. The bag-of-words model is the most popular VSM approach because of its simplicity and general applicability. However, the bag-of-words model does not include dependencies of the terms and has a high dimensionality. Several models for document representation have been proposed in the literature in order to capture the dependence among the terms, especially models based on phrases or compound terms, the Generalized Vector Space Model (GVSM) and their extensions, non-probabilistic topic models as Latent Semantic Analysis (LSA) or Non-negative Matrix Factorization (NMF) and still probabilistic topic models as the Latent Dirichlet Allocation (LDA) and their extensions. The topic model representation is one of the most interesting approaches since it provides a structure that describes the collection of documents in a way that reveals their internal structure and their interrelationships. Also, this approach provides a dimensionality reduction strategy aiming to built new dimensions that represent the main topics or ideas of the document collection. However, the efficient extraction of information about the relations of terms for document representation is still a major research challenge nowadays. The document representation models that explore correlated terms usually face a great challenge of keeping a good balance among the (i) number of extracted features, (ii) the computational performance and (iii) the interpretability of new features. In this way, we proposed the Latent Association Rule Cluster based Model (LARCM). The LARCM is a non-probabilistic topic model that explores association rule clustering to build a document representation with low dimensionality in a way that each dimension is composed by information about the relations among the terms. In the proposed approach, the association rules are built for each document to extract the correlated terms that will compose the multi-word expressions. These relations among the terms are the local context of relations. Then, a clustering process is applied for all association rules to discover the general context of the relations, and each obtained cluster is an extracted topic or a dimension of the new document representation. This work also proposes in this work an evaluation methodology to select topic models that maximize the results in the text classification task as much as the interpretability of the obtained topics. The LARCM model was compared against both the traditional LDA model and the LDA model using a document representation that includes multi-word expressions (bag-of-related-words). The experimental results indicate that LARCM provides an document representation that improves the results in the text classification task and even retains a good interpretability of the extract topics. The LARCM model also achieved great results as a method to extract contextual information for context-aware recommender systems.
|
82 |
Projective geometry, toric algebra and tropical computationsGörlach, Paul 04 December 2020 (has links)
No description available.
|
83 |
Efficient Inversion of Large-Scale Problems Exploiting Structure and RandomizationJanuary 2020 (has links)
abstract: Dimensionality reduction methods are examined for large-scale discrete problems, specifically for the solution of three-dimensional geophysics problems: the inversion of gravity and magnetic data. The matrices for the associated forward problems have beneficial structure for each depth layer of the volume domain, under mild assumptions, which facilitates the use of the two dimensional fast Fourier transform for evaluating forward and transpose matrix operations, providing considerable savings in both computational costs and storage requirements. Application of this approach for the magnetic problem is new in the geophysics literature. Further, the approach is extended for padded volume domains.
Stabilized inversion is obtained efficiently by applying novel randomization techniques within each update of the iteratively reweighted scheme. For a general rectangular linear system, a randomization technique combined with preconditioning is introduced and investigated. This is shown to provide well-conditioned inversion, stabilized through truncation. Applying this approach, while implementing matrix operations using the two dimensional fast Fourier transform, yields computationally effective inversion, in memory and cost. Validation is provided via synthetic data sets, and the approach is contrasted with the well-known LSRN algorithm when applied to these data sets. The results demonstrate a significant reduction in computational cost with the new algorithm. Further, this new algorithm produces results for inversion of real magnetic data consistent with those provided in literature.
Typically, the iteratively reweighted least squares algorithm depends on a standard Tikhonov formulation. Here, this is solved using both a randomized singular value de- composition and the iterative LSQR Krylov algorithm. The results demonstrate that the new algorithm is competitive with these approaches and offers the advantage that no regularization parameter needs to be found at each outer iteration.
Given its efficiency, investigating the new algorithm for the joint inversion of these data sets may be fruitful. Initial research on joint inversion using the two dimensional fast Fourier transform has recently been submitted and provides the basis for future work. Several alternative directions for dimensionality reduction are also discussed, including iteratively applying an approximate pseudo-inverse and obtaining an approximate Kronecker product decomposition via randomization for a general matrix. These are also topics for future consideration. / Dissertation/Thesis / Doctoral Dissertation Applied Mathematics 2020
|
84 |
Construction and Visualization of Semantic Spaces for Domain-Specific Text CorporaChoudhary, Rishabh R. 04 October 2021 (has links)
No description available.
|
85 |
Data mining / Data miningMrázek, Michal January 2019 (has links)
The aim of this master’s thesis is analysis of the multidimensional data. Three dimensionality reduction algorithms are introduced. It is shown how to manipulate with text documents using basic methods of natural language processing. The goal of the practical part of the thesis is to process real-world data from the internet forum. Posted messages are transformed to the numerical representation, then to two-dimensional space and visualized. Later on, topics of the messages are discovered. In the last part, a few selected algorithms are compared.
|
86 |
Sample-Efficient Reinforcement Learning of Robot Control Policies in the Real WorldJanuary 2019 (has links)
abstract: The goal of reinforcement learning is to enable systems to autonomously solve tasks in the real world, even in the absence of prior data. To succeed in such situations, reinforcement learning algorithms collect new experience through interactions with the environment to further the learning process. The behaviour is optimized by maximizing a reward function, which assigns high numerical values to desired behaviours. Especially in robotics, such interactions with the environment are expensive in terms of the required execution time, human involvement, and mechanical degradation of the system itself. Therefore, this thesis aims to introduce sample-efficient reinforcement learning methods which are applicable to real-world settings and control tasks such as bimanual manipulation and locomotion. Sample efficiency is achieved through directed exploration, either by using dimensionality reduction or trajectory optimization methods. Finally, it is demonstrated how data-efficient reinforcement learning methods can be used to optimize the behaviour and morphology of robots at the same time. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2019
|
87 |
Automatic Generation of Descriptive Features for Predicting Vehicle FaultsRevanur, Vandan, Ayibiowu, Ayodeji January 2020 (has links)
Predictive Maintenance (PM) has been increasingly adopted in the Automotive industry, in the recent decades along with conventional approaches such as the Preventive Maintenance and Diagnostic/Corrective Maintenance, since it provides many advantages to estimate the failure before the actual occurrence proactively, and also being adaptive to the present status of the vehicle, in turn allowing flexible maintenance schedules for efficient repair or replacing of faulty components. PM necessitates the storage and analysis of large amounts of sensor data. This requirement can be a challenge in deploying this method on-board the vehicles due to the limited storage and computational power on the hardware of the vehicle. Hence, this thesis seeks to obtain low dimensional descriptive features from high dimensional data using Representation Learning. This low dimensional representation will be used for predicting vehicle faults, specifically Turbocharger related failures. Since the Logged Vehicle Data (LVD) was base on all the data utilized in this thesis, it allowed for the evaluation of large populations of trucks without requiring additional measuring devices and facilities. The gradual degradation methodology is considered for describing vehicle condition, which allows for modeling the malfunction/ failure as a continuous process rather than a discrete flip from healthy to an unhealthy state. This approach eliminates the challenge of data imbalance of healthy and unhealthy samples. Two important hypotheses are presented. Firstly, Parallel StackedClassical Autoencoders would produce better representations com-pared to individual Autoencoders. Secondly, employing Learned Em-beddings on Categorical Variables would improve the performance of the Dimensionality reduction. Based on these hypotheses, a model architecture is proposed and is developed on the LVD. The model is shown to achieve good performance, and in close standards to the previous state-of-the-art research. This thesis, finally, illustrates the potential to apply parallel stacked architectures with Learned Embeddings for the Categorical features, and a combination of feature selection and extraction for numerical features, to predict the Remaining Useful Life (RUL) of a vehicle, in the context of the Turbocharger. A performance improvement of 21.68% with respect to the Mean Absolute Error (MAE) loss with an 80.42% reduction in the size of data was observed.
|
88 |
Improving Support-vector machines with Hyperplane foldingSöyseth, Carl, Ekelund, Gustav January 2019 (has links)
Background. Hyperplane folding was introduced by Lars Lundberg et al. in Hyperplane folding increased the margin while suffering from a flaw, referred to asover-rotation in this thesis. The aim of this thesis is to introduce a new different technique thatwould not over-rotate data points. This novel technique is referred to as RubberBand folding in the thesis. The following research questions are addressed: 1) DoesRubber Band folding increases classification accuracy? 2) Does Rubber Band fold-ing increase the Margin? 3) How does Rubber Band folding effect execution time? Rubber Band folding was implemented and its result was compared toHyperplane folding and the Support-vector machine. This comparison was done byapplying Stratified ten-fold cross-validation on four data sets for research question1 & 2. Four folds were applied for both Hyperplane folding and Rubber Band fold-ing, as more folds can lead to over-fitting. While research question 3 used 15 folds,in order to see trends and is not affected by over-fitting. One BMI data set, wasartificially made for the initial Hyperplane folding paper. Another data set labeled patients with, or without a liver disorder. Another data set predicted if patients havebenign- or malign cancer cells. Finally, a data set predicted if a hepatitis patient isalive within five years.Results.Rubber Band folding achieved a higher classification accuracy when com-pared to Hyperplane folding in all data sets. Rubber Band folding increased theclassification in the BMI data set and cancer data set while the accuracy for Rub-ber Band folding decreased in liver and hepatitis data sets. Hyperplane folding’saccuracy decreased in all data sets.Both Rubber Band folding and Hyperplane folding increases the margin for alldata sets tested. Rubber Band folding achieved a margin higher than Hyperplanefolding’s in the BMI and Liver data sets. Execution time for both the classification ofdata points and the training time for the classifier increases linearly per fold. RubberBand folding has slower growth in classification time when compared to Hyperplanefolding. Rubber Band folding can increase the classification accuracy, in whichexact cases are unknown. It is howevered believed to be when the data is none-linearly seperable.Rubber Band folding increases the margin. When compared to Hyperplane fold-ing, Rubber Band folding can in some cases, achieve a higher increase in marginwhile in some cases Hyperplane folding achieves a higher margin.Both Hyperplane folding and Rubber Band folding increases training time andclassification time linearly. The difference between Hyperplane folding and RubberBand folding in training time was negligible while Rubber bands increase in classifi-cation time was lower. This was attributed to Rubber Band folding rotating fewerpoints after 15 folds.
|
89 |
Dimensionality Reduction in Healthcare Data Analysis on Cloud PlatformRay, Sujan January 2020 (has links)
No description available.
|
90 |
Reduced and coded sensing methods for x-ray based securitySun, Zachary Z. 05 November 2016 (has links)
Current x-ray technologies provide security personnel with non-invasive sub-surface imaging and contraband detection in various portal screening applications such as checked and carry-on baggage as well as cargo. Computed tomography (CT) scanners generate detailed 3D imagery in checked bags; however, these scanners often require significant power, cost, and space. These tomography machines are impractical for many applications where space and power are often limited such as checkpoint areas. Reducing the amount of data acquired would help reduce the physical demands of these systems. Unfortunately this leads to the formation of artifacts in various applications, thus presenting significant challenges in reconstruction and classification. As a result, the goal is to maintain a certain level of image quality but reduce the amount of data gathered. For the security domain this would allow for faster and cheaper screening in existing systems or allow for previously infeasible screening options due to other operational constraints. While our focus is predominantly on security applications, many of the techniques can be extended to other fields such as the medical domain where a reduction of dose can allow for safer and more frequent examinations.
This dissertation aims to advance data reduction algorithms for security motivated x-ray imaging in three main areas: (i) development of a sensing aware dimensionality reduction framework, (ii) creation of linear motion tomographic method of object scanning and associated reconstruction algorithms for carry-on baggage screening, and (iii) the application of coded aperture techniques to improve and extend imaging performance of nuclear resonance fluorescence in cargo screening. The sensing aware dimensionality reduction framework extends existing dimensionality reduction methods to include knowledge of an underlying sensing mechanism of a latent variable. This method provides an improved classification rate over classical methods on both a synthetic case and a popular face classification dataset. The linear tomographic method is based on non-rotational scanning of baggage moved by a conveyor belt, and can thus be simpler, smaller, and more reliable than existing rotational tomography systems at the expense of more challenging image formation problems that require special model-based methods. The reconstructions for this approach are comparable to existing tomographic systems. Finally our coded aperture extension of existing nuclear resonance fluorescence cargo scanning provides improved observation signal-to-noise ratios. We analyze, discuss, and demonstrate the strengths and challenges of using coded aperture techniques in this application and provide guidance on regimes where these methods can yield gains over conventional methods.
|
Page generated in 0.1605 seconds