Global ETD Search

1	Category-theoretic quantitative compositional distributional models of natural language semantics Grefenstette, Edward Thomas January 2013 (has links) This thesis is about the problem of compositionality in distributional semantics. Distributional semantics presupposes that the meanings of words are a function of their occurrences in textual contexts. It models words as distributions over these contexts and represents them as vectors in high dimensional spaces. The problem of compositionality for such models concerns itself with how to produce distributional representations for larger units of text (such as a verb and its arguments) by composing the distributional representations of smaller units of text (such as individual words). This thesis focuses on a particular approach to this compositionality problem, namely using the categorical framework developed by Coecke, Sadrzadeh, and Clark, which combines syntactic analysis formalisms with distributional semantic representations of meaning to produce syntactically motivated composition operations. This thesis shows how this approach can be theoretically extended and practically implemented to produce concrete compositional distributional models of natural language semantics. It furthermore demonstrates that such models can perform on par with, or better than, other competing approaches in the field of natural language processing. There are three principal contributions to computational linguistics in this thesis. The first is to extend the DisCoCat framework on the syntactic front and semantic front, incorporating a number of syntactic analysis formalisms and providing learning procedures allowing for the generation of concrete compositional distributional models. The second contribution is to evaluate the models developed from the procedures presented here, showing that they outperform other compositional distributional models present in the literature. The third contribution is to show how using category theory to solve linguistic problems forms a sound basis for research, illustrated by examples of work on this topic, that also suggest directions for future research. 006.3
2	Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing Gorrell, Genevieve January 2006 (has links) The current surge of interest in search and comparison tasks in natural language processing has brought with it a focus on vector space approaches and vector space dimensionality reduction techniques. Presenting data as points in hyperspace provides opportunities to use a variety of welldeveloped tools pertinent to this representation. Dimensionality reduction allows data to be compressed and generalised. Eigen decomposition and related algorithms are one category of approaches to dimensionality reduction, providing a principled way to reduce data dimensionality that has time and again shown itself capable of enabling access to powerful generalisations in the data. Issues with the approach, however, include computational complexity and limitations on the size of dataset that can reasonably be processed in this way. Large datasets are a persistent feature of natural language processing tasks. This thesis focuses on two main questions. Firstly, in what ways can eigen decomposition and related techniques be extended to larger datasets? Secondly, this having been achieved, of what value is the resulting approach to information retrieval and to statistical language modelling at the ngram level? The applicability of eigen decomposition is shown to be extendable through the use of an extant algorithm; the Generalized Hebbian Algorithm (GHA), and the novel extension of this algorithm to paired data; the Asymmetric Generalized Hebbian Algorithm (AGHA). Several original extensions to the these algorithms are also presented, improving their applicability in various domains. The applicability of GHA to Latent Semantic Analysisstyle tasks is investigated. Finally, AGHA is used to investigate the value of singular value decomposition, an eigen decomposition variant, to ngram language modelling. A sizeable perplexity reduction is demonstrated. Generalized Hebbian Algorithm Language Modelling Singular Value Decomposition Eigen Decomposition Latent Semantic Analysis Vector Space Models Computational linguistics Datorlingvistik
3	Φασματικές μέθοδοι ανάκτησης πληροφορίας, εργαλεία λογισμικού και εφαρμογές Ζεϊμπέκης, Δημήτριος 20 October 2009 (has links) Η διαρκώς αυξανόμενη διαθεσιμότητα ηλεκτρονικών πηγών πληροφόρησης έχει δημιουργήσει νέα δεδομένα και απαιτήσεις στην περιοχή της Ανάκτησης Πληροφορίας. Υπάρχει αδιάκοπη ανάγκη για βελτίωση των υπαρχόντων και σχεδίαση νέων αλγορίθμων, που να επιτυγχάνουν υψηλή απόδοση και αξιοπιστία. Ένα επιπλέον ζητούμενο είναι η κατασκευή λογισμικού περιβάλλοντος που θα διευκολύνει τη χρήση υπαρχόντων αλγορίθμων, την εισαγωγή νέων, το συνδυασμό τους και τη συγκριτική αξιολόγησή τους. Στην παρούσα διδακτορική διατριβή, εστιάζουμε σε μεθόδους ανάκτησης πληροφορίας (με έμφαση στην ανάκτηση κειμένου), που έχουν στον πυρήνα τους τεχνολογίες Γραμμικής Άλγεβρας και πιο συγκεκριμένα σε τεχνικές που αξιοποιούν τα φασματικά χαρακτηριστικά των μητρώων όρων-κειμένων. Υπενθυμίζουμε ότι περίοπτη θέση στην περιοχή της Ανάκτησης Πληροφορίας, όσον αφορά τις τεχνικές της γραμμικής άλγεβρας, κατέχουν οι ιδιάζουσες τιμές και τα ιδιάζοντα διανύσματα των μητρώων. Περιγράφουμε επίσης το σχεδιασμό και την κατασκευή ενός ολοκληρωμένου περιβάλλοντος που διευκολύνει τους χρήστες στην ανάπτυξη, χρήση και αξιολόγηση των αλγορίθμων που στηρίζεται στο εξαιρετικά διαδεδομένο περιβάλλον της MATLAB. Αρχικά, εξετάζουμε τα βασικά προβλήματα στην Ανάκτηση Πληροφορίας, που είναι η ομαδοποίηση, η εξαγωγή σχετικών κειμένων και η κατηγοριοποίηση. Στην πρώτη κατηγορία προβλημάτων, στόχος μας είναι η βελτίωση παραδοσιακών αλγορίθμων όπως οι k-means και PDDP. Στο πλαίσιο αυτό προτείνουμε ένα σύνολο υβριδικών τεχνικών που βασίζονται στους δύο αυτούς αλγορίθμους και αντιμετωπίζουν προβλήματα που σχετίζονται με αυτούς. Ειδικότερα, πετυχαίνουν τη βελτίωση της απόδοσής τους ως προς την ποιότητα των παρεχόμενων αποτελεσμάτων ή ως προς την ταχύτητά τους. Σε σύγκριση με τον k-means, επιτυγχάνουν την αφαίρεση του στοιχείου της τυχαιότητας που τον χαρακτηρίζει, λόγω της γνωστής ευαισθησίας του στις αρχικές συνθήκες. Επιπλέον, προτείνουμε ένα ενιαίο σύνολο αποδοτικών "μεθόδων πυρήνα" (kernel methods) που μπορούν να χρησιμοποιηθούν στην περίπτωση που τα δεδομένα του προβλήματος έχουν μη γραμμικά χαρακτηριστικά. Οι παραπάνω υβριδικές μέθοδοι εφαρμόζονται και στο πρόβλημα της μπλοκ διαγωνιοποίησης στοχαστικών μητρώων που μοντελοποιούν για παράδειγμα χημικές διεργασίες, μέσω μαρκοβιανών αλυσίδων. Τα αρχικά αποτελέσματα που έχουμε, υποδεικνύουν ότι η προσέγγιση αυτή μπορεί να βελτιώσει σημαντικά υπάρχουσες μεθόδους, παρέχοντας ταυτόχρονα προσεγγίσεις του πλήθους των μπλοκ που αντιστοιχούν σε σταθερές καταστάσεις της μαρκοβιανής αλυσίδας. Τέλος, προτείνουμε μια διαφορετική προσέγγιση με τον αλγόριθμο ομαδοποίησης Oriented k-windows ο οποίος, όπως και ο PDDP, χρησιμοποιεί ιδιάζοντα διανύσματα (ισοδύναμα, κύριους άξονες - PCA) με σκοπό την εξαγωγή πληροφορίας αναφορικά με τον κυρίαρχο προσανατολισμό των ομάδων στον Ευκλείδειο χώρο. Στη συνέχεια, παρουσιάζουμε αλγορίθμους ανάκτησης σχετικών κειμένων και αλγορίθμους κατηγοριοποίησης που βασίζονται στη "Λανθάνουσα Σημασιολογική Δεικτοδότηση" (LSI). Πιο συγκεκριμένα, παρουσιάζουμε ένα αλγοριθμικό πλαίσιο που στηρίζεται σε μια "μεθοδολογία αντιπροσώπων", με την οποία προσπαθούμε να προσεγγίσουμε σημασιολογικά μια συλλογή, εξάγοντας υποχώρους του χώρου στηλών του μητρώου όρων-κειμένων που προσεγγίζουν τον βέλτιστο υποχώρο της διάσπασης ιδιαζουσών τιμών. Η μεθοδολογία μας χρησιμοποιεί αλγορίθμους ομαδοποίησης, όπως οι υβριδικές μέθοδοι που αναφέραμε, με σκοπό τη διάσπαση του προβλήματος σε ένα σύνολο όσο γίνεται περισσότερο ανεξάρτητων προβλημάτων που μπορούν να λυθούν περισσότερο αποδοτικά. Μέσα από μια εκτεταμένη πειραματική μελέτη, δείχνουμε ότι η συγκεκριμένη μεθοδολογία μπορεί να βελτιώσει άλλες διαδεδομένες προσεγγίσεις (LSI, LLSF κ.λπ.). Επίσης, επεκτείνουμε και εφαρμόζουμε τη "μεθοδολογία αντιπροσώπων" σε μεθόδους πυρήνα, καθώς επίσης και στο πρόβλημα υπολογισμού μη αρνητικών παραγοντοποίησεων μητρώων (NMF). Δείχνουμε ότι η χρήση της μεθοδολογίας επιφέρει σημαντική μείωση του κόστους σε μνήμη και υπολογισμούς των μεθόδων πυρήνα και βελτίωση της ποιότητας των αποτελεσμάτων της NMF. Η διατριβή στάθηκε αφορμή για την ανάπτυξη ενός ολοκληρωμένου λογισμικού περιβάλλοντος. Πιο συγκεκριμένα, οι νέες μέθοδοι που αναφέραμε, καθώς και άλλες διαδεδομένες τεχνικές έχουν υλοποιηθεί και ενταχθεί στο περιβάλλον Text to Matrix Generator (TMG). Το TMG στηρίζεται κατά κύριο λόγο στη MATLAB ενώ μικρότερα τμήματά του έχουν γραφτεί σε Perl. Το TMG αποτελείται από έξι τμήματα, ενώ είναι εύκολα επεκτάσιμο. Τα τμήματα αυτά παρέχουν μια ευρεία συλλογή μεθόδων ανάκτησης πληροφορίας που αποτελείται από μεθόδους (i) κατασκευής και ανανέωσης μητρώων όρων-κειμένων, (ii) υπολογισμού προσεγγίσεων μειωμένης διάστασης και (iii) μη αρνητικών παραγοντοποιήσεων, (iv) ανάκτησης σχετικών κειμένων, (v) ομαδοποίησης και (vi) κατηγοριοποίησης. Για όλα τα παραπάνω, το εργαλείο παρέχει κατάλληλα προσαρμοσμένες γραφικές διεπαφές που διευκολύνουν το χρήστη. Εναλλακτικά, οι λειτουργίες του μπορούν να κληθούν απευθείας από τη γραμμή εντολών. Το TMG διευκολύνει την ταχεία προτοτυποποίηση αλγορίθμων και διατίθεται ελεύθερα μέσω ιστοσελίδας (http://scgroup.hpclab.ceid.upatras.gr/scgroup/Projects/TMG/). Από αναζητήσεις τεκμηριώνεται ότι έχει υποστηρίξει πολλούς επιστήμονες παγκοσμίως τόσο σε ερευνητικό όσο και σε εκπαιδευτικό επίπεδο. Περιγράφουμε επίσης τις πρόσφατες εργασίες μας για την ανάδειξη του TMG ως υπηρεσίας στον Παγκόσμιο Ιστό. Ειδικότερα, αναπτύσσεται λογισμικό για την απομακρυσμένη χρήση του TMG μέσω ειδικού API και τίθενται οι βάσεις για μελλοντική έρευνα που θα αφορά στην βελτιωμένη επίδοση και στην αποδοτική χρήση του συστήματος. / The amount of digital data is rapidly growing and continuously motivates research innovation in Information Retrieval. Much of the data is text, so there is an ever present need to push the field of Text Mining forward by designing and implementing novel, effective algorithms that attain high performance and reliability. It is also desirable to develop software environments that facilitate not only access to existing methods, but also enable the rapid prototyping, performance evaluation and incorporation of new algorithms for Text Mining. In this research we focus on algorithms that use Linear Algebra and Matrix Analysis tools as computational kernels. We use the term spectral to highlight the fact that our methods rely on the spectral characteristics of the underlying term-document matrices that encode the texts under study. We consolidate our new and existing algorithms in a software environment, called TMG, that we built on top of MATLAB and Perl. First, we consider the basic text mining tasks, namely clustering, ad-hoc retrieval and text classication. In clustering, we focus on a well-known spectral method, called PDDP (Principal Direction Divisive Partitioning) and investigate hybrid methods that combine PDDP and standard workhorses such as k-means. In particular, the proposed methods improve the performance of the aforementioned algorithms, regarding the quality of the attained clustering and/or their speed. Compared with k-means, our algorithms eliminate the non-determinism originating from k-means' initialization phase. We also propose a framework for kernel methods, that can be used in case the data exhibit non-linearities. Our spectral clustering algorithms are applied in sparse matrix reordering, specifically in the block diagonalization of row stochastic matrices. In addition to helping in the intepretation of a recent method for identifying metastable states of Markov chains, they also provide the means to improve their performance. Initial results, demonstrate that the proposed methodology can improve significantly over existing techniques, deriving approximations of the number of blocks corresponding to dinstict stable states of the underlying Markov chain. We also show how to use spectral methods to improve the performance of a density-based clustering approach, called Oriented k-windows. In particular, the algorithm uses information derived from the Principal Component Analysis (PCA), in order to guide a windowing technique, namely k-windows, that could give insights about the data orientation. The next part of the thesis deals with ad-hoc retrieval and classification methods, based on Latent Semantic Indexing (LSI). We propose an algorithmic framework based on a "representatives methodology", in order to approximate a collection semantically, by extracting subspaces of the column space of the term-document matrix, that approximate the optimal subspace derived by the SVD. Our methodology uses clustering techniques, like the aforementioned hybrid methods, in a preprocessing stage. Our objective is to split the problem into a set of independent subproblems that could be solved more efficiently. Results from extensive experimentation indicate that our methodology can improve a state-of-the-art method like LSI. We also apply the representatives methodology to kernel methods and Nonnegative Matrix Factorization (NMF). Extensive numerical experiments indicate that this methodology improves the computational cost and memory requirements of kernel methods and also increases the quality of the nonnegative approximations. We have incorporated all the proposed methods in a software environment, called Text to Matrix Generator (TMG). The first release of TMG was before this Ph.D. was even started. but has since undergone several upgrades and rewrites. TMG currently consists of six easily extensible modules. These modules provide methods for (i) constructing and updating term-document matrices, (ii) computing low rank approximations and (iii) non negative factorizations, and (iv) ad-hoc retrieval, (v) clustering and (vi) classification. TMG is accessible in two primary modes, graphical and command line and is freely downloadable from its webpage (http://scgroup.hpclab.ceid.upatras.gr/scgroup/Projects/TMG/). As our usage logs indicate, TMG is being used worldwide for research and educational uses. We also describe a brief overview of open problems and ongoing work. We describe our first version of "remote TMG", that views TMG as a Web resource and provides remote access mode to it by means of a special API. Ομαδοποίηση Κατηγοριοποίηση Προσέγγιση μητρώων 005.74 Vector space models Clustering Classification PDDP Matrix approximation Text to matrix generators CLSI Oriented k-windows
4	Compositional distributional semantics with compact closed categories and Frobenius algebras Kartsaklis, Dimitrios January 2014 (has links) The provision of compositionality in distributional models of meaning, where a word is represented as a vector of co-occurrence counts with every other word in the vocabulary, offers a solution to the fact that no text corpus, regardless of its size, is capable of providing reliable co-occurrence statistics for anything but very short text constituents. The purpose of a compositional distributional model is to provide a function that composes the vectors for the words within a sentence, in order to create a vectorial representation that re ects its meaning. Using the abstract mathematical framework of category theory, Coecke, Sadrzadeh and Clark showed that this function can directly depend on the grammatical structure of the sentence, providing an elegant mathematical counterpart of the formal semantics view. The framework is general and compositional but stays abstract to a large extent. This thesis contributes to ongoing research related to the above categorical model in three ways: Firstly, I propose a concrete instantiation of the abstract framework based on Frobenius algebras (joint work with Sadrzadeh). The theory improves shortcomings of previous proposals, extends the coverage of the language, and is supported by experimental work that improves existing results. The proposed framework describes a new class of compositional models thatfind intuitive interpretations for a number of linguistic phenomena. Secondly, I propose and evaluate in practice a new compositional methodology which explicitly deals with the different levels of lexical ambiguity (joint work with Pulman). A concrete algorithm is presented, based on the separation of vector disambiguation from composition in an explicit prior step. Extensive experimental work shows that the proposed methodology indeed results in more accurate composite representations for the framework of Coecke et al. in particular and every other class of compositional models in general. As a last contribution, I formalize the explicit treatment of lexical ambiguity in the context of the categorical framework by resorting to categorical quantum mechanics (joint work with Coecke). In the proposed extension, the concept of a distributional vector is replaced with that of a density matrix, which compactly represents a probability distribution over the potential different meanings of the specific word. Composition takes the form of quantum measurements, leading to interesting analogies between quantum physics and linguistics. 512
5	Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA / with Applications for QuantNet 2.0 and GitHub Borke, Lukas 08 September 2017 (has links) Mit der wachsenden Popularität von GitHub, dem größten Online-Anbieter von Programm-Quellcode und der größten Kollaborationsplattform der Welt, hat es sich zu einer Big-Data-Ressource entfaltet, die eine Vielfalt von Open-Source-Repositorien (OSR) anbietet. Gegenwärtig gibt es auf GitHub mehr als eine Million Organisationen, darunter solche wie Google, Facebook, Twitter, Yahoo, CRAN, RStudio, D3, Plotly und viele mehr. GitHub verfügt über eine umfassende REST API, die es Forschern ermöglicht, wertvolle Informationen über die Entwicklungszyklen von Software und Forschung abzurufen. Unsere Arbeit verfolgt zwei Hauptziele: (I) ein automatisches OSR-Kategorisierungssystem für Data Science Teams und Softwareentwickler zu ermöglichen, das Entdeckbarkeit, Technologietransfer und Koexistenz fördert. (II) Visuelle Daten-Exploration und thematisch strukturierte Navigation innerhalb von GitHub-Organisationen für reproduzierbare Kooperationsforschung und Web-Applikationen zu etablieren. Um Mehrwert aus Big Data zu generieren, ist die Speicherung und Verarbeitung der Datensemantik und Metadaten essenziell. Ferner ist die Wahl eines geeigneten Text Mining (TM) Modells von Bedeutung. Die dynamische Kalibrierung der Metadaten-Konfigurationen, TM Modelle (VSM, GVSM, LSA), Clustering-Methoden und Clustering-Qualitätsindizes wird als "Smart Clusterization" abgekürzt. Data-Driven Documents (D3) und Three.js (3D) sind JavaScript-Bibliotheken, um dynamische, interaktive Datenvisualisierung zu erzeugen. Beide Techniken erlauben Visuelles Data Mining (VDM) in Webbrowsern, und werden als D3-3D abgekürzt. Latent Semantic Analysis (LSA) misst semantische Information durch Kontingenzanalyse des Textkorpus. Ihre Eigenschaften und Anwendbarkeit für Big-Data-Analytik werden demonstriert. "Smart clusterization", kombiniert mit den dynamischen VDM-Möglichkeiten von D3-3D, wird unter dem Begriff "Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA" zusammengefasst. / With the growing popularity of GitHub, the largest host of source code and collaboration platform in the world, it has evolved to a Big Data resource offering a variety of Open Source repositories (OSR). At present, there are more than one million organizations on GitHub, among them Google, Facebook, Twitter, Yahoo, CRAN, RStudio, D3, Plotly and many more. GitHub provides an extensive REST API, which enables scientists to retrieve valuable information about the software and research development life cycles. Our research pursues two main objectives: (I) provide an automatic OSR categorization system for data science teams and software developers promoting discoverability, technology transfer and coexistence; (II) establish visual data exploration and topic driven navigation of GitHub organizations for collaborative reproducible research and web deployment. To transform Big Data into value, in other words into Smart Data, storing and processing of the data semantics and metadata is essential. Further, the choice of an adequate text mining (TM) model is important. The dynamic calibration of metadata configurations, TM models (VSM, GVSM, LSA), clustering methods and clustering quality indices will be shortened as "smart clusterization". Data-Driven Documents (D3) and Three.js (3D) are JavaScript libraries for producing dynamic, interactive data visualizations, featuring hardware acceleration for rendering complex 2D or 3D computer animations of large data sets. Both techniques enable visual data mining (VDM) in web browsers, and will be abbreviated as D3-3D. Latent Semantic Analysis (LSA) measures semantic information through co-occurrence analysis in the text corpus. Its properties and applicability for Big Data analytics will be demonstrated. "Smart clusterization" combined with the dynamic VDM capabilities of D3-3D will be summarized under the term "Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA". GitHub Mining Infrastruktur Software Mining Text Mining Verallgemeinerte Vektorraum-Modelle YAML Visuelles Data Mining Visual Analytics D3.js Clusteranalyse Qualitätsindizes Validierungs-Pipeline Verteilt-paralleles Rechnen Reproduzierbare Kooperationsforschung Risk Analytics GitHub Mining infrastructure Software Mining Text Mining Generalized Vector Space Models YAML Visual Data Mining Visual Analytics D3.js Cluster Analysis Quality Indices Validation Pipeline Cluster and Parallel Computing Collaborative Reproducible Research Risk Analytics 330 Wirtschaft QH 250 ddc:330

1

Page generated in 0.0857 seconds