Software Architecture Recovery based on Pattern MatchingSartipi, Kamran January 2003 (has links)
Pattern matching approaches in reverse engineering aim to incorporate domain knowledge and system documentation in the software architecture extraction process, hence provide a user/tool collaborative environment for architectural design recovery. This thesis presents a model and an environment for recovering the high level design of legacy software systems based on user defined architectural patterns and graph matching techniques.
In the proposed model, a high-level view of a software system in terms of the system components and their interactions is represented as a query, using a description language. A query is mapped onto a pattern-graph, where a module and its interactions with other modules are represented as a group of graph-nodes and a group of graph-edges, respectively. Interaction constraints can be modeled by the description language as a part of the query. Such a pattern-graph is applied against an entity-relation graph that represents the information extracted from the source code of the software system. An approximate graph matching process performs a series of graph edit operations (i. e. , node/edge insertion/deletion) on the pattern-graph and uses a ranking mechanism based on data mining association to obtain a sub-optimal solution. The obtained solution corresponds to an extracted architecture that complies with the given query.
An interactive prototype toolkit implemented as part of this thesis provides an environment for architecture recovery in two levels. First the system is decomposed into a number of subsystems of files. Second each subsystem can be decomposed into a number of modules of functions, datatypes, and variables.
Γραμματειακή υποστήριξη σχολών πανεπιστημίων : Ανάπτυξη ιστοσελίδας με χρήση τεχνολογιών Σημασιολογικού Ιστού (Semantic Web)Φωτεινός, Γεώργιος 30 April 2014 (has links)
Ένα υποσύνολο του τεράστιου όγκου πληροφοριών του Ιστού αφορά τα Ανοικτά Δεδομένα (Open Data), τα οποία αποτελούν πληροφορίες, δημόσιες ή άλλες, στις οποίες ο καθένας μπορεί να έχει πρόσβαση και να τις χρησιμοποιεί περαιτέρω για οποιονδήποτε σκοπό με στόχο να προσθέσει αξία σε αυτές. Η δυναμική των ανοιχτών δεδομένων γίνεται αντιληπτή όταν σύνολα δεδομένων των δημόσιων οργανισμών μετατρέπονται σε πραγματικά ανοιχτά δεδομένα, δηλαδή χωρίς νομικούς, οικονομικούς ή τεχνολογικούς περιορισμούς για την περαιτέρω χρήση τους από τρίτους. Τα ανοικτά δεδομένα ενός Τμήματος ή Σχολής Πανεπιστημίου μπορούν να δημιουργήσουν προστιθέμενη αξία και να έχουν θετικό αντίκτυπο σε πολλές διαφορετικές περιοχές, στη συμμετοχή, την καινοτομία, τη βελτίωση της αποδοτικότητας και αποτελεσματικότητας των Πανεπιστημιακών υπηρεσιών, την παραγωγή νέων γνώσεων από συνδυασμό στοιχείων κ.α. Ο τελικός στόχος είναι τα ανοικτά δεδομένα να καταστούν Ανοικτά Διασυνδεδεμένα Δεδομένα. Τα Διασυνδεδεμένα Δεδομένα, αποκτούν νόημα αντιληπτό και επεξεργάσιμο από μηχανές, επειδή περιγράφονται σημασιολογικά με την χρήση οντολογιών. Έτσι τα δεδομένα γίνονται πιο «έξυπνα» και πιο χρήσιμα μέσα από την διάρθρωση που αποκτούν. Στην παρούσα διπλωματική εργασία, υλοποιείται μια πρότυπη δικτυακή πύλη με την χρήση του Συστήματος Διαχείρισης Περιεχομένου CMS Drupal, το οποίο ενσωματώνει τεχνολογίες Σημασιολογικού Ιστού στον πυρήνα του, με σκοπό την μετατροπή των δεδομένων ενός Τμήματος ή Σχολής Πανεπιστημίου σε Ανοικτά Διασυνδεδεμένα Δεδομένα διαθέσιμα στην τρίτη γενιά του Ιστού τον Σημασιολογικό Ιστό. / A subset of the vast amount of information of the web is concerned with open data, which is information, whether public or other, in which everyone can have access and use it for any purpose with a view to add value. The dynamics of open data becomes noticeable when datasets of public bodies are transformed into truly open data , i.e. without legal, financial or technological limitations for further use by third parties. The open data of a university department or faculty can add value and have a positive impact on many different areas such as participation, innovation, improvisation of the efficiency and effectiveness of university services, generating new knowledge from a combination of elements , etc. The ultimate goal is to transform open data into open linked data. The linked data , become meaningful and processable by machines, given that they are semantically described, using ontologies. Thus, the data become more " intelligent " and more useful through the structure they acquire. In this thesis , a prototype web portal is implemented using the content management system CMS Drupal, which incorporates semantic web technologies in the core, in order to convert the data of a University Department or School in open linked data available in the third generation web semantic web.
Αξιοποίηση τεχνολογιών ανοικτού κώδικα για την ανάπτυξη εφαρμογών σημασιολογικού ιστούΚασσέ, Παρασκευή 14 February 2012 (has links)
Τα τελευταία χρόνια υπάρχει εκθετική αύξηση του όγκου της πληροφορίας που δημοσιεύεται στο Διαδίκτυο. Καθώς όμως η πληροφορία αυτή δε συνδέεται με τη σημασιολογία της παρατηρείται δυσκολία στη διαχείρισή της και στην πρόσβαση σε αυτήν. Ο Σημασιολογικός Ιστός, λοιπόν, είναι μια ομάδα μεθόδων και τεχνολογιών που σκοπεύουν να δώσουν τη δυνατότητα στις μηχανές να κατανοήσουν τη “σημασιολογία” των πληροφοριών σχετικά με τον Παγκόσμιο Ιστό.
Ο Σημασιολογικός Ιστός (Semantic Web) αποτελεί επέκταση του Παγκοσμίου Ιστού. Στο Σημασιολογικό Ιστό οι πληροφορίες εμπλουτίζονται με μεταδεδομένα, τα οποία υπακουούν σε κοινά πρότυπα και επιτρέπουν την εξαγωγή γνώσεως από την ήδη υπάρχουσα, καθώς επίσης και το συνδυασμό της υπάρχουσας πληροφορίας με στόχο την εξαγωγή συμπερασμάτων. Απώτερος στόχος του Σημασιολογικού Ιστού είναι η βελτιωμένη αναζήτηση, η εκτέλεση σύνθετων διεργασιών και η εξατομίκευση της πληροφορίας σύμφωνα με τις ανάγκες του κάθε χρήστη.
Στην παρούσα διπλωματική εργασία μελετήθηκε η χρήση των τεχνολογιών του Σημασιολογικού Ιστού για τη βελτίωση της πρόσβασης σε πολιτισμικά δεδομένα. Συγκεκριμένα αρχικά έγινε εμβάθυνση στις τεχνολογίες και στις θεμελιώδεις έννοιες του Σημασιολογικού Ιστού. Παρουσιάστηκαν αναλυτικά οι βασικές γλώσσες σήμανσης: XML που επιτρέπει τη δημιουργία δομημένων εγγράφων με λεξιλόγιο καθορισμένο από το χρήστη, RDF που προσφέρει ένα μοντέλο δεδομένων για την περιγραφή πληροφοριών με τέτοιο τρόπο ώστε να είναι δυνατή η ανάγνωση και η κατανόησή τους από μηχανές. Αναφέρθηκαν, ακόμη, οι διάφοροι τρόποι σύνταξης της γλώσσας RDF καθώς και πως γίνεται αναζήτηση σε γράφους RDF με το πρωτόκολλο SPARQL. Στη συνέχεια ακολουθεί η περιγραφή της RDFS, που πρόκειται για γλώσσα περιγραφής του RDF λεξιλογίου. Έχοντας παρουσιαστεί σε προηγούμενο κεφάλαιο η έννοια της οντολογίας, γίνεται αναφορά στη σημασιολογική γλώσσα σήμανσης OWL, που χρησιμοποιείται για την έκδοση και διανομή οντολογιών στο Διαδίκτυο. Έπειτα ακολουθεί μια ανασκόπηση από επιλεγμένα έργα, ελληνικά, ευρωπαϊκά και διεθνή, των τελευταίων ετών που χρησιμοποιούν τις τεχνολογίες του Σημασιολογικού Ιστού στο τομέα του πολιτισμού και της πολιτισμικής κληρονομιάς. Τέλος στο έβδομο κεφάλαιο παρουσιάζεται μία εφαρμογή διαχείρισης αρχαιολογικών χώρων-μνημείων και μελετώνται σε βάθος οι τεχνολογίες και τα εργαλεία που χρησιμοποιήθηκαν για την υλοποίησή της. / Over the past few years there has been exponential increase of the volume of information published on the Internet. Since information is not connected to its semantics, it is difficult to manipulate and access it. Therefore, the Semantic Web consists of methods and technologies that aim to enable machines to understand information’s semantics.
The Semantic Web is an extension of the World Wide Web (WWW). Specifically, information is enriched with metadata, which are subject to common standards and permit knowledge extraction from the existing one and the combination of existing information in order to infer implicit knowledge, as well. Future goals of the Semantic Web are enhanced searching, complicated processes’ execution and information personalization according to each user’s needs.
This post-graduate diploma thesis researches the usage of Semantic Web technologies for the enhancement of the access to cultural data. More specifically, Semantic Web technologies and essential concepts were studied. Basic markup languages were presented analytically: XML that allows structured documents’ creation with user defined vocabulary, RDF that offers a data model for such information description that it is readable and understandable by machines. Also, various RDF syntaxes and how to search RDF graphs using SPARQL protocol were referred. Below RDFS description follows, that is a description language of RDF vocabulary. After having introduced the concept of ontology in previous chapter, the semantic markup language OWL is presented, that is used for ontology publishing and distribution on the Internet. A review of selected projects of the last years, Greek, European and international, which are characterized by the application of technologies of the Semantic Web in the sector of Culture and Cultural heritage, is presented. In the last chapter, an application that manages archaeological places- sites is presented and it is studied technologies and tools that were used for it.
Vers un environnement logiciel générique et ouvert pour le développement d'applications NFC sécurisées / Towards a generic and open software environment for the development of Secure NFC applicationsLesas, Anne-Marie 14 September 2017 (has links)
Dans le domaine des transactions et du paiement électronique avec une carte à puce, le standard de communication en champ proche « Near Field Communication » (NFC) s’est imposé comme la technologie des transactions sans contact mobiles sécurisées pour le paiement, le contrôle d’accès, ou l’authentification. Les services sans contact mobiles sécurisés sont basés sur le mode émulation de carte du standard NFC qui implique une composante matérielle à accès restreint de type carte à puce appelée « Secure Element » (SE) dans laquelle sont stockées les données confidentielles et les fonctions sensibles. Malgré les efforts de standardisation de l'écosystème, les modèles proposés pour la mise en œuvre du SE sont complexes et souffrent du manque de généricité, à la fois pour offrir des mécanismes d’abstraction, pour le développement d’applications de haut niveau, et pour la mise en œuvre et la vérification des contraintes de sécurité des applications.L’objectif de la thèse est de concevoir et réaliser un environnement logiciel basé sur un modèle générique compatible avec les standards établis et peu sensible aux évolutions technologiques. Cet environnement devrait permettre à des non-experts de développer des applications multiplateformes, multimodes, multi-facteurs de forme, qui s’interfacent avec le SE dans un smartphone NFC. / In the field of electronic transactions and payment with a smart card, the Near Field Communication (NFC) standard has stood out against other candidate technologies for secure mobile contactless transactions for payment, access control, or authentication. Secure mobile contactless services are based on the card emulation mode of the NFC standard which involves a smart card type component with restricted access called "Secure Element" (SE) in which sensitive data and sensitive functions are securely stored and run. Despite considerable standardization efforts around the SE ecosystem, the proposed models for the implementation of SE are complex and suffer from the lack of genericity, both to offer abstraction mechanisms, for the development of high-level applications, and for the implementation and verification of applications security constraints.The objective of the thesis is to design and realize a software environment based on a generic model that complies with established standards and which is not very sensitive to technological evolutions. This environment should enable non-experts to develop multi-platform, multi-mode, multi-factor SE-based applications running into the NFC smartphone.
Role-based Data ManagementJäkel, Tobias 24 March 2017 (has links)
Database systems build an integral component of today’s software systems and as such they are the central point for storing and sharing a software system’s data while ensuring global data consistency at the same time. Introducing the primitives of roles and their accompanied metatype distinction in modeling and programming languages, results in a novel paradigm of designing, extending, and programming modern software systems. In detail, roles as modeling concept enable a separation of concerns within an entity. Along with its rigid core, an entity may acquire various roles in different contexts during its lifetime and thus, adapts its behavior and structure dynamically during runtime.
Unfortunately, database systems, as important component and global consistency provider of such systems, do not keep pace with this trend. The absence of a metatype distinction, in terms of an entity’s separation of concerns, in the database system results in various problems for the software system in general, for the application developers, and finally for the database system itself. In case of relational database systems, these problems are concentrated under the term role-relational impedance mismatch. In particular, the whole software system is designed by using different semantics on various layers. In case of role-based software systems in combination with relational database systems this gap in semantics between applications and the database system increases dramatically. Consequently, the database system cannot directly represent the richer semantics of roles as well as the accompanied consistency constraints. These constraints have to be ensured by the applications and the database system loses its single point of truth characteristic in the software system. As the applications are in charge of guaranteeing global consistency, their development requires more effort in data management. Moreover, the software system’s data management is distributed over several layers, which results in an unstructured software system architecture.
To overcome the role-relational impedance mismatch and bring the database system back in its rightful position as single point of truth in a software system, this thesis introduces the novel and tripartite RSQL approach. It combines a novel database model that represents the metatype distinction as first class citizen in a database system, an adapted query language on the database model’s basis, and finally a proper result representation. Precisely, RSQL’s logical database model introduces Dynamic Data Types, to directly represent the separation of concerns within an entity type on the schema level. On the instance level, the database model defines the notion of a Dynamic Tuple that combines an entity with the notion of roles and thus, allows for dynamic structure adaptations during runtime without changing an entity’s overall type.
These definitions build the main data structures on which the database system operates. Moreover, formal operators connecting the query language statements with the database model data structures, complete the database model. The query language, as external database system interface, features an individual data definition, data manipulation, and data query language. Their statements directly represent the metatype distinction to address Dynamic Data Types and Dynamic Tuples, respectively. As a consequence of the novel data structures, the query processing of Dynamic Tuples is completely redesigned. As last piece for a complete database integration of a role-based notion and its accompanied metatype distinction, we specify the RSQL Result Net as result representation. It provides a novel result structure and features functionalities to navigate through query results. Finally, we evaluate all three RSQL components in comparison to a relational database system. This assessment clearly demonstrates the benefits of the roles concept’s full database integration.
Improvements of the syntax of the query language DQL / Förbättringar i syntax för query språket DQLDiep, Mikael, Cheimonettos, Anestis January 2023 (has links)
This thesis focuses on improving the syntax of a query language named DQL(Dynamic Query Language) in order to enhance the user experience and productivity of its users. The study investigates the original state of the query language and identifies areas for improvement in terms of intuitiveness, efficiency, and consistency. Through an extensive review of existing literature and case studies, the thesis develops a set of guidelines for designing intuitive query languages that minimise the cognitive load for users. The thesis also proposes several modifications to the syntax of DQL that aim to simplify the structure and improve the readability of queries. Finally, the thesis evaluates the effectiveness of the proposed modifications through semi-structured interviews to compare the original syntax with the proposed new one.
Erschließung domänenübergreifender Informationsräume mit MultimodellenFuchs, Sebastian 23 October 2015 (has links)
Mit dem Übergang von bauwerksorientierter zu prozessorientierter Arbeitsweise erlangt die domänenübergreifende Bereitstellung von Informationen wachsende Bedeutung. Das betrifft bspw. die Erstellung von Controlling-Kennwerten, die Vorbereitung von Simulationen oder die Betrachtung neuer Aspekte wie Energieeffizienz. Aktuelle Datenformate und Erschließungsmethoden können diese Herausforderung jedoch nicht befriedigend bewältigen. Daher bedarf es einer Methode, welche interdisziplinäre Bauinformationsprozesse uneingeschränkt ermöglicht. Vorhandene Kommunikationsprozesse und Fachanwendungen sollen dabei beibehalten und weitergenutzt werden können.
Mit der Multimodell-Methode wird ein Lösungsansatz für die strukturellen Probleme interdisziplinärer Bauinformationsprozesse vorgestellt. Multimodelle bündeln heterogene Fachmodelle unterschiedlicher Domänen und erlauben die Verbindung ihrer Elemente in externen, ID-basierten Linkmodellen. Da die Fachmodelle unberührt bleiben, wird auf diesem Weg eine lose und temporäre Kopplung ermöglicht. Durch den Verzicht auf ein führendes oder integrierendes Datenschema werden keine Transformationsprozesse benötigt, können etablierte und heute übliche Datenformate weitergenutzt und die verlinkten Fachmodelle neutral ausgetauscht werden.
Die in Multimodellen verknüpften Daten bieten einen informationellen Mehrwert gegenüber alleinstehenden Fachmodellen. Zusammengehörende Informationen können über die persistenten Links automatisch ausgewertet werden, anstelle manuell vom Menschen immer wieder flüchtig neu zugeordnet werden zu müssen. Somit erscheint ein Multimodell gegenüber einem Benutzer wie ein einziger abgeschlossener Informationsraum.
Um solche datenmodell-, datenformat- und domänenübergreifenden Informationsräume komfortabel erstellen und filtern zu können, wird die deklarative Multimodell-Abfragesprache MMQL eingeführt. Diese erlaubt einen generischen Zugriff auf die Originaldaten und bildet die Kernkonzepte der Multimodell-Erschließung - mehrwertige Linkerzeugung und strukturelle Linksemantik - ab. Ein zugehöriger Interpreter ermittelt den Lösungsweg für konkrete Anweisungen und führt diesen auf realen Daten aus.
Die Umsetzung und Bereitstellung der Konzepte als IT-Komponenten auf verschiedenen Ebenen - von der Datenstruktur über Bibliotheken und Services bis hin zur alleinstehenden, universellen Multimodell-Software M2A2 - erlaubt die sofortige und direkte Anwendung der Multimodell-Methode in der Praxis.:1. Einleitung 1
1.1. Motivation 1
1.2. Ausgangspunkt 2
1.3. Zielsetzung 3
1.4. Lösungsansatz 5
1.5. Aufbau der Arbeit 7
2. Informationsräume im Bauwesen 9
2.1. Grundlagen der Datenmodelle 10
2.2. Baufachmodelle 17
2.3. Domänenübergreifende Bauinformationsräume 25
2.4. Resümee 39
3. Das Multimodellkonzept 41
3.1. Das Multimodell-Paradigma 42
3.2. Multimodellbasierte Arbeitsweise 48
3.3. Prinzip und Aufbau von Multimodellen 55
3.4. Anwendbare Fachmodelle 67
3.5. Multimodell-Spezialisierung 72
3.6. Multimodell-Operationen 78
3.7. Resümee 81
4. Die Multimodell-Abfragesprache MMQL 83
4.1. Konzeption 84
4.2. Zugriff auf Originaldaten 90
4.3. Multimodell-Filtern 102
4.4. Linkmanipulation 118
4.5. Resümee 124
5. Interpretation von MMQL-Anweisungen 127
5.1. Grundlagen der Ausführung der Sprache 128
5.2. Ermittlung von Multimodell-Views 134
5.3. Links erstellen 148
5.4. Links löschen 153
5.5. Diskussion und Resümee 153
6. Implementierung und Anwendung 159
6.1. Universelle Multimodell-Software M2A2 160
6.2. Multimodell-Spezialisierung für das Bauprojektmanagement 165
6.3. Multimodellbasierte Ermittlung von Zahlungsplänen 167
6.4. Bewertung und Resümee 174
7. Fazit 177
7.1. Zusammenfassung 177
7.2. Ergebnisdiskussion 178
7.3. Ausblick 183
A. Datenmodelle und Spezifikationen 185
A.1. Das Generische Multimodell 185
A.2. Fachmodell Dokumentencontainer 187
A.3. MMQL: Formale Sprachbeschreibung 188
B. Elementarmodell-Vokabulare 191
B.1. Domain 191
B.2. Phase 192
B.3. Level of Detail 196
B.4. Status 196
C. Implementierungsdetails 197
C.1. Liste der in M2A2 implementierten Baufachmodelle 197
C.2. Implementierte Erweiterungen der M2A2-Plattform 198
C.3. XML-Schema des Mefisto-Multimodell-Containers 199
Literaturverzeichnis 201 / With the transition of building-oriented to process-oriented work, the provision of cross-domain information gained growing importance - for example in the creation of controlling parameters, the preparation of simulations or when considering new aspects such as energy efficiency. However, current data formats and access methods cannot cope with this challenge satisfactory. Therefore, a method is required, that enables interdisciplinary construction information processes fully. Thereby existing communication processes and domain applications have to be retained and continued to be used as possible.
With the multi-model method, an approach to structural problems of such interdisciplinary construction information processes is presented. Multi-models combine heterogeneous models of different domains and allow the connection of their elements in external ID-based link models. As the domain models remain unaffected, a loose and temporary coupling is possible in this way. By not using a leading or integrating data schema, no transformation processes are required, common established data formats can be retained and the linked domain models can be exchanged neutrally.
The linked data in multi-models offer an additional value of information over single domain models. Information belonging together can be automatically evaluated by the persistent links - instead of being repeatedly reassigned by people in a volatile way. Thus, a multi-model appears to a user as a single self-contained information space.
In order to create and filter such cross-format and cross-domain information spaces comfortably, the declarative multi-model query language MMQL is introduced. It allows for generic access to the original data and integrates the core concepts of the multi-model development - n-ary link generation and structural link semantics. An associated interpreter determines the approach for specific instructions and executes it on real data.
The implementation and deployment of the concepts as IT components at various levels - from the data structure via libraries and services, to the universal multi-model software M2A2 - allows an immediate and direct application of the multi-model method in practice.:1. Einleitung 1
1.1. Motivation 1
1.2. Ausgangspunkt 2
1.3. Zielsetzung 3
1.4. Lösungsansatz 5
1.5. Aufbau der Arbeit 7
2. Informationsräume im Bauwesen 9
2.1. Grundlagen der Datenmodelle 10
2.2. Baufachmodelle 17
2.3. Domänenübergreifende Bauinformationsräume 25
2.4. Resümee 39
3. Das Multimodellkonzept 41
3.1. Das Multimodell-Paradigma 42
3.2. Multimodellbasierte Arbeitsweise 48
3.3. Prinzip und Aufbau von Multimodellen 55
3.4. Anwendbare Fachmodelle 67
3.5. Multimodell-Spezialisierung 72
3.6. Multimodell-Operationen 78
3.7. Resümee 81
4. Die Multimodell-Abfragesprache MMQL 83
4.1. Konzeption 84
4.2. Zugriff auf Originaldaten 90
4.3. Multimodell-Filtern 102
4.4. Linkmanipulation 118
4.5. Resümee 124
5. Interpretation von MMQL-Anweisungen 127
5.1. Grundlagen der Ausführung der Sprache 128
5.2. Ermittlung von Multimodell-Views 134
5.3. Links erstellen 148
5.4. Links löschen 153
5.5. Diskussion und Resümee 153
6. Implementierung und Anwendung 159
6.1. Universelle Multimodell-Software M2A2 160
6.2. Multimodell-Spezialisierung für das Bauprojektmanagement 165
6.3. Multimodellbasierte Ermittlung von Zahlungsplänen 167
6.4. Bewertung und Resümee 174
7. Fazit 177
7.1. Zusammenfassung 177
7.2. Ergebnisdiskussion 178
7.3. Ausblick 183
A. Datenmodelle und Spezifikationen 185
A.1. Das Generische Multimodell 185
A.2. Fachmodell Dokumentencontainer 187
A.3. MMQL: Formale Sprachbeschreibung 188
B. Elementarmodell-Vokabulare 191
B.1. Domain 191
B.2. Phase 192
B.3. Level of Detail 196
B.4. Status 196
C. Implementierungsdetails 197
C.1. Liste der in M2A2 implementierten Baufachmodelle 197
C.2. Implementierte Erweiterungen der M2A2-Plattform 198
C.3. XML-Schema des Mefisto-Multimodell-Containers 199
Literaturverzeichnis 201
Bridging Language & Data : Optimizing Text-to-SQL Generation in Large Language Models / Från ord till SQL : Optimering av text-till-SQL-generering i stora språkmodellerWretblad, Niklas, Gordh Riseby, Fredrik January 2024 (has links)
Text-to-SQL, which involves translating natural language into Structured Query Language (SQL), is crucial for enabling broad access to structured databases without expert knowledge. However, designing models for such tasks is challenging due to numerous factors, including the presence of ’noise,’ such as ambiguous questions and syntactical errors. This thesis provides an in-depth analysis of the distribution and types of noise in the widely used BIRD-Bench benchmark and the impact of noise on models. While BIRD-Bench was created to model dirty and noisy database values, it was not created to contain noise and errors in the questions and gold queries. We found after a manual evaluation that noise in questions and gold queries are highly prevalent in the financial domain of the dataset, and a further analysis of the other domains indicate the presence of noise in other parts as well. The presence of incorrect gold SQL queries, which then generate incorrect gold answers, has a significant impact on the benchmark’s reliability. Surprisingly, when evaluating models on corrected SQL queries, zero-shot baselines surpassed the performance of state-of-the-art prompting methods. The thesis then introduces the concept of classifying noise in natural language questions, aiming to prevent the entry of noisy questions into text-to-SQL models and to annotate noise in existing datasets. Experiments using GPT-3.5 and GPT-4 on a manually annotated dataset demonstrated the viability of this approach, with classifiers achieving up to 0.81 recall and 80% accuracy. Additionally, the thesis explored the use of LLMs for automatically correcting faulty SQL queries. This showed a 100% success rate for specific query corrections, highlighting the potential for LLMs in improving dataset quality. We conclude that informative noise labels and reliable benchmarks are crucial to developing new Text-to-SQL methods that can handle varying types of noise.
Real-time Business Intelligence through Compact and Efficient Query Processing Under UpdatesIdris, Muhammad 05 March 2019 (has links) (PDF)
Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends, and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain query results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flows of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation), however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems.In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of queries that are based on the relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g. timestamp attributes) along with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints while non-temporal comparisons can either be equality or inequality constraints. Hence these systems mostly process inequality joins. As a starting point for our research, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kinds of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub)results incur high update latency and systems that materialize (sub)results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems and present a main-memory data representation that allows to enumerate query (sub)results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLRs).We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries), 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only), 3) they take space linear in the size of the database, 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm called the Dynamic Yannakakis (DYN) algorithm. Then, we present the generalization of the DYN algorithm to the class of acyclic queries featuring multi-way Theta-joins with projections and call it Generalized DYN (GDYN). We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of DYN and GDYN over DCLRs are based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present algorithms to test a conjunctive query featuring Theta-joins for acyclicity and to generate GJTs for such queries. We extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to testing a conjunctive query featuring multi-way Theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic.GDYN is hence a unified framework based on DCLRs that enables processing of queries that appear in streaming systems as well as in BI systems in a unified main-memory model and addresses the space-time trade-off. We instantiate GDYN to the particular case where all Theta-joins involve only equalities and inequalities and call this instantiation IEDYN. We implement DYN and IEDYN as query compilers that generate executable programs in the Scala programming language and provide all the necessary data structures and their maintenance and enumeration methods in a continuous stream processing model. We evaluate DYN and IEDYN against state-of-the-art BI and streaming systems on both industrial and synthetically generated benchmarks. We show that DYN and IEDYN outperform the existing systems by over an order of magnitude efficiency in both memory footprint and update processing time. / Doctorat en Sciences de l'ingénieur et technologie / info:eu-repo/semantics/nonPublished
