Global ETD Search

11	Derby/S: A DBMS for Sample-Based Query Answering Klein, Anja, Gemulla, Rainer, Rösch, Philipp, Lehner, Wolfgang 10 November 2022 (has links) Although approximate query processing is a prominent way to cope with the requirements of data analysis applications, current database systems do not provide integrated and comprehensive support for these techniques. To improve this situation, we propose an SQL extension---called SQL/S---for approximate query answering using random samples, and present a prototypical implementation within the engine of the open-source database system Derby---called Derby/S. Our approach significantly reduces the required expert knowledge by enabling the definition of samples in a declarative way; the choice of the specific sampling scheme and its parametrization is left to the system. SQL/S introduces new DDL commands to easily define and administrate random samples subject to a given set of optimization criteria. Derby/S automatically takes care of sample maintenance if the underlying dataset changes. Finally, samples are transparently used during query processing, and error bounds are provided. Our extensions do not affect traditional queries and provide the means to integrate sampling as a first-class citizen into a DBMS. info:eu-repo/classification/ddc/004 ddc:004
12	DataCalc: Ad-hoc Analyses on Heterogeneous Data Sources Luong, Johannes, Habich, Dirk, Lehner, Wolfgang 19 July 2023 (has links) Storing and processing data at different locations using a heterogeneous set of formats and data managements systems is state-of-the-art in many organizations. However, data analyses can often provide better insight when data from several sources is integrated into a combined perspective. In this paper we present an overview of our data integration system DataCalc. DataCalc is an extensible integration platform that executes adhoc analytical queries on a set of heterogeneous data processors. Our novel platform uses an expressive function shipping interface that promotes local computation and reduces data movement between processors. In this paper, we provide a discussion of the overall architecture and the main components of DataCalc. Moreover, we discuss the cost of integrating additional processors and evaluate the overall performance of the platform. info:eu-repo/classification/ddc/004 ddc:004
13	A Technical Perspective of DataCalc: Ad-hoc Analyses on Heterogeneous Data Sources Luong, Johannes, Habich, Dirk, Lehner, Wolfgang 19 July 2023 (has links) Many organizations store and process data at different locations using a heterogeneous set of formats and data management systems. However, data analyses can often provide better insight when data from several sources is integrated into a combined perspective. DataCalc is an extensible data integration platform that executes ad-hoc analytical queries on a set of heterogeneous data processors. The platform uses an expressive function shipping interface that promotes local computation and reduces data movement between processors. In this paper, we provide a detailed discussion of the architecture and implementation of DataCalc. We introduce data processors for plain files, JDBC, the MongoDB document store, and a custom in memory system. Finally, we discuss the cost of integrating additional processors and evaluate the overall performance of the platform. Our main contribution is the specification and evaluation of the DataCalc code delegation interface. info:eu-repo/classification/ddc/004 ddc:004
14	Intelligent Data Layer: : An approach to generating data layer from normalized database model. Buzo, Amir January 2012 (has links) Model View Controller (MVC) software architecture is widely spread and commonly used in application’s development. Therefore generation of data layer for the database model is able to reduce cost and time. After research on current Object Relational Mapping (ORM) tools, it was discovered that there are generating tools like Data Access Object (DAO) and Hibernate, however their usage causes problems like inefficiency and slow performance due to many connections with database and set up time. Most of these tools are trying to solve specific problems rather than generating a data layer which is an important component and the bottom layer of database centred applications. The proposed solution to the problem is an engineering approach where we have designed a tool named Generated Intelligent Data Layer (GIDL). GIDL tool generates small models which create the main data layer of the system according to the Database Model. The goal of this tool is to enable and allow software developers to work only with object without deep knowledge in SQL. The problem of transaction and commit is solved by the tool. Also filter objects are constructed for filtering the database. GIDL tool reduced the number of connections and also have a cache where to store object lists and modify them. The tool is compared under the same environment with Hibernate and showed a better performance in terms of time evaluations for the same functions. GIDL tool is beneficial for software developers, because it generates the entire data layer. Object Relational Mapping (ORM) Generated Intelligent Data Layer (GIDL) Relational Database Microsoft SQL Server Object Oriented Design Pattern Model Model View Controller High Query Language Software Engineering Programvaruteknik Computer Sciences Datavetenskap (datalogi)
15	Bridging Language & Data : Optimizing Text-to-SQL Generation in Large Language Models / Från ord till SQL : Optimering av text-till-SQL-generering i stora språkmodeller Wretblad, Niklas, Gordh Riseby, Fredrik January 2024 (has links) Text-to-SQL, which involves translating natural language into Structured Query Language (SQL), is crucial for enabling broad access to structured databases without expert knowledge. However, designing models for such tasks is challenging due to numerous factors, including the presence of ’noise,’ such as ambiguous questions and syntactical errors. This thesis provides an in-depth analysis of the distribution and types of noise in the widely used BIRD-Bench benchmark and the impact of noise on models. While BIRD-Bench was created to model dirty and noisy database values, it was not created to contain noise and errors in the questions and gold queries. We found after a manual evaluation that noise in questions and gold queries are highly prevalent in the financial domain of the dataset, and a further analysis of the other domains indicate the presence of noise in other parts as well. The presence of incorrect gold SQL queries, which then generate incorrect gold answers, has a significant impact on the benchmark’s reliability. Surprisingly, when evaluating models on corrected SQL queries, zero-shot baselines surpassed the performance of state-of-the-art prompting methods. The thesis then introduces the concept of classifying noise in natural language questions, aiming to prevent the entry of noisy questions into text-to-SQL models and to annotate noise in existing datasets. Experiments using GPT-3.5 and GPT-4 on a manually annotated dataset demonstrated the viability of this approach, with classifiers achieving up to 0.81 recall and 80% accuracy. Additionally, the thesis explored the use of LLMs for automatically correcting faulty SQL queries. This showed a 100% success rate for specific query corrections, highlighting the potential for LLMs in improving dataset quality. We conclude that informative noise labels and reliable benchmarks are crucial to developing new Text-to-SQL methods that can handle varying types of noise. Chaining Classification Data Quality Few-Shot Learning Large Language Model Machine Learning Noise Prompt Prompt Engineering SQL Structured Query Language Text-to-SQL Zero-Shot Learning Noise Identification
16	Publikace dat ze sítě meteostanic ve formátu DATEX II / Implementation of Datex II standard for road transport weather stations Partika, Marek January 2016 (has links) Master’s thesis deals with implementation of a European standard DATEX II. This standard specifies the data format for information transmission in road transport. The road traffic is flowing streams of current information. For the work was selected network of meteorological stations, which will publish the measured data, ie weather conditions of road transport. Measured data will be available to consumers in the format DATEX II. Implementation will be operational in its entirety meteorological station from design to the actual web service that will produce data information for consumers.

Page generated in 0.0969 seconds