Global ETD Search

1	Parallel Data Mining On Cycle Stealing Networks Robertson, Calum Stewart January 2004 (has links) In a world where electronic databases are used to store ever-increasing quantities of data it is becoming harder to mine useful information from them. Therefore there is a need for a highly scalable parallel architecture capable of handling the ever-increasing complexity of data mining problems. A cycle stealing network is one possible scalable solution to this problem. A cycle stealing network allows users to donate their idle cycles to form a virtual supercomputer by connecting multiple machines via a network. This research aims to establish whether cycle stealing networks, specifically the G2 system developed at the Queensland University of Technology, are viable for large scale data mining problems. The computationally intensive sequence mining, feature selection and functional dependency mining problems are deliberately chosen to test the usefulness and scalability of G2. Tests have shown that G2 is highly scalable where the ratio of computation to communication is approximately known. However for combinatorial problems where computation times are difficult or impossible to predict, and communication costs can be unpredictable, G2 often provides little or no speedup. This research demonstrates that existing sequence mining and functional dependency mining techniques are not suited to a client-server style cycle stealing network like G2. However the feature selection is well suited to G2, and a new sequence mining algorithm offers comparable performance to other existing, non-cycle stealing, parallel sequence mining algorithms. Furthermore new functional dependency mining algorithms offer substantial benefit over existing serial algorithms. data mining problems parallel sequence algorithms functional dependency
2	TOOL DEVELOPMENT FOR TEST OPTIMIZATION PURPOSES Cako, Gezim January 2021 (has links) Background: Software testing is a crucial part of the system's development life-cycle, which pays off in detecting flaws and defects, alternatively leading to high-quality products. Generally, software testing is performed manually by a human operator or automatically. While many test cases are written and executed, the testing process checks if all the requirements are covered, and the system exhibits the expected behavior. A great portion of the cost and time of the software development is spent on testing; therefore, considering the type of the software, test optimization is needed and presented as a solution in cost efficiency and time-saving. Aim: This thesis aims to propose and evaluate the improved sOrTES+ tool for test optimization purposes, consisting of selection, prioritization, and scheduling of the test cases integrated into a dynamic user interface. Method: In this thesis, test optimization is addressed in two aspects, low-level requirements and high-level requirements. Our solution analyzes these requirements to detect the dependencies between test cases. Thus, we propose sOrTES+, a tool that uses three different scheduling techniques: Greedy, Greedy DO(direct output), and Greedy TO(total output) for test optimization. The mentioned techniques are integrated into a dynamic user interface that allows testers to manage their projects, see useful information about test cases and requirements, store the executed test cases while scheduling the remaining ones for execution, and also switch between the mentioned scheduling techniques regarding the project requirements. Finally, we demonstrated its applicability and compared our tool with existing testing techniques used by our industrial partner, Alstom company, evaluating the efficiency in terms of requirement coverage and troubleshooting time. Results: Our comparison shows that our solution improves the requirement coverage, increasing it by 26.4% while decreasing the troubleshooting time by 6%. Conclusion: Based on our results, we conclude that our proposed tool, sOrTES+, can be used for test optimization and it performs more efficiently than the existing methods used by industrial partner Alstom company. Functional Dependency Software Testing Scheduling. Computer Sciences Datavetenskap (datalogi)
3	Efficient Detection of XML Integrity Constraints / Efficient Detection of XML Integrity Constraints Švirec, Michal January 2011 (has links) Title: Efficient Detection of XML Integrity Constraints Author: Michal Švirec Department: Department of Software Engineering Supervisor: RNDr. Irena Mlýnková, Ph.D. Abstract: Knowledge of integrity constraints covered in XML data is an impor- tant aspect of efficient data processing. However, although integrity constraints are defined for the given data, it is a common phenomenon that data violate the predefined set of constraints. Therefore detection of these inconsistencies and consecutive repair has emerged. This work extends and refines recent approaches to repairing XML documents violating defined set of integrity constraints, specif- ically so-called functional dependencies. The work proposes the repair algorithm incorporating the weight model and also involve a user into the process of de- tection and subsequent application of appropriate repair of inconsistent XML documents. Experimental results are part of the work. Keywords: XML, functional dependency, functional dependencies violations, vi- olations repair
4	Improve Data Quality By Using Dependencies And Regular Expressions Feng, Yuan January 2018 (has links) The objective of this study has been to answer the question of finding ways to improve the quality of database. There exists a lot of problems of the data stored in the database, like missing or spelling errors. To deal with the dirty data in the database, this study adopts the conditional functional dependencies and regular expressions to detect and correct data. Based on the former studies of data cleaning methods, this study considers the more complex conditions of database and combines the efficient algorithms to deal with the data. The study shows that by using these methods, the database’s quality can be improved and considering the complexity of time and space, there still has a lot of things to do to make the data cleaning process more efficiency. data cleaning data quality condition functional dependency regular expression Computer Systems Datorsystem
5	Modeling and Querying Uncertainty in Data Cleaning Beskales, George January 2012 (has links) Data quality problems such as duplicate records, missing values, and violation of integrity constrains frequently appear in real world applications. Such problems cost enterprises billions of dollars annually, and might have unpredictable consequences in mission-critical tasks. The process of data cleaning refers to detecting and correcting errors in data in order to improve the data quality. Numerous efforts have been taken towards improving the effectiveness and the efficiency of the data cleaning. A major challenge in the data cleaning process is the inherent uncertainty about the cleaning decisions that should be taken by the cleaning algorithms (e.g., deciding whether two records are duplicates or not). Existing data cleaning systems deal with the uncertainty in data cleaning decisions by selecting one alternative, based on some heuristics, while discarding (i.e., destroying) all other alternatives, which results in a false sense of certainty. Furthermore, because of the complex dependencies among cleaning decisions, it is difficult to reverse the process of destroying some alternatives (e.g., when new external information becomes available). In most cases, restarting the data cleaning from scratch is inevitable whenever we need to incorporate new evidence. To address the uncertainty in the data cleaning process, we propose a new approach, called probabilistic data cleaning, that views data cleaning as a random process whose possible outcomes are possible clean instances (i.e., repairs). Our approach generates multiple possible clean instances to avoid the destructive aspect of current cleaning systems. In this dissertation, we apply this approach in the context of two prominent data cleaning problems: duplicate elimination, and repairing violations of functional dependencies (FDs). First, we propose a probabilistic cleaning approach for the problem of duplicate elimination. We define a space of possible repairs that can be efficiently generated. To achieve this goal, we concentrate on a family of duplicate detection approaches that are based on parameterized hierarchical clustering algorithms. We propose a novel probabilistic data model that compactly encodes the defined space of possible repairs. We show how to efficiently answer relational queries using the set of possible repairs. We also define new types of queries that reason about the uncertainty in the duplicate elimination process. Second, in the context of repairing violations of FDs, we propose a novel data cleaning approach that allows sampling from a space of possible repairs. Initially, we contrast the existing definitions of possible repairs, and we propose a new definition of possible repairs that can be sampled efficiently. We present an algorithm that randomly samples from this space, and we present multiple optimizations to improve the performance of the sampling algorithm. Third, we show how to apply our probabilistic data cleaning approach in scenarios where both data and FDs are unclean (e.g., due to data evolution or inaccurate understanding of the data semantics). We propose a framework that simultaneously modifies the data and the FDs while satisfying multiple objectives, such as consistency of the resulting data with respect to the resulting FDs, (approximate) minimality of changes of data and FDs, and leveraging the trade-off between trusting the data and trusting the FDs. In presence of uncertainty in the relative trust in data versus FDs, we show how to extend our cleaning algorithm to efficiently generate multiple possible repairs, each of which corresponds to a different level of relative trust. Data Cleaning Duplicate Elimination Functional Dependency Violation Probabilistic Cleaning Computer Science
6	Modeling and Querying Uncertainty in Data Cleaning Beskales, George January 2012 (has links) Data quality problems such as duplicate records, missing values, and violation of integrity constrains frequently appear in real world applications. Such problems cost enterprises billions of dollars annually, and might have unpredictable consequences in mission-critical tasks. The process of data cleaning refers to detecting and correcting errors in data in order to improve the data quality. Numerous efforts have been taken towards improving the effectiveness and the efficiency of the data cleaning. A major challenge in the data cleaning process is the inherent uncertainty about the cleaning decisions that should be taken by the cleaning algorithms (e.g., deciding whether two records are duplicates or not). Existing data cleaning systems deal with the uncertainty in data cleaning decisions by selecting one alternative, based on some heuristics, while discarding (i.e., destroying) all other alternatives, which results in a false sense of certainty. Furthermore, because of the complex dependencies among cleaning decisions, it is difficult to reverse the process of destroying some alternatives (e.g., when new external information becomes available). In most cases, restarting the data cleaning from scratch is inevitable whenever we need to incorporate new evidence. To address the uncertainty in the data cleaning process, we propose a new approach, called probabilistic data cleaning, that views data cleaning as a random process whose possible outcomes are possible clean instances (i.e., repairs). Our approach generates multiple possible clean instances to avoid the destructive aspect of current cleaning systems. In this dissertation, we apply this approach in the context of two prominent data cleaning problems: duplicate elimination, and repairing violations of functional dependencies (FDs). First, we propose a probabilistic cleaning approach for the problem of duplicate elimination. We define a space of possible repairs that can be efficiently generated. To achieve this goal, we concentrate on a family of duplicate detection approaches that are based on parameterized hierarchical clustering algorithms. We propose a novel probabilistic data model that compactly encodes the defined space of possible repairs. We show how to efficiently answer relational queries using the set of possible repairs. We also define new types of queries that reason about the uncertainty in the duplicate elimination process. Second, in the context of repairing violations of FDs, we propose a novel data cleaning approach that allows sampling from a space of possible repairs. Initially, we contrast the existing definitions of possible repairs, and we propose a new definition of possible repairs that can be sampled efficiently. We present an algorithm that randomly samples from this space, and we present multiple optimizations to improve the performance of the sampling algorithm. Third, we show how to apply our probabilistic data cleaning approach in scenarios where both data and FDs are unclean (e.g., due to data evolution or inaccurate understanding of the data semantics). We propose a framework that simultaneously modifies the data and the FDs while satisfying multiple objectives, such as consistency of the resulting data with respect to the resulting FDs, (approximate) minimality of changes of data and FDs, and leveraging the trade-off between trusting the data and trusting the FDs. In presence of uncertainty in the relative trust in data versus FDs, we show how to extend our cleaning algorithm to efficiently generate multiple possible repairs, each of which corresponds to a different level of relative trust. Data Cleaning Duplicate Elimination Functional Dependency Violation Probabilistic Cleaning Computer Science
7	Optimisation des requêtes skyline multidimensionnelles / Optimization of multidimensional skyline queries Kamnang Wanko, Patrick 09 February 2017 (has links) Dans le cadre de la sélection de meilleurs éléments au sein d’une base de données multidimensionnelle, plusieurs types de requêtes ont été définies. L’opérateur skyline présente l’avantage de ne pas nécessiter la définition d’une fonction de score permettant de classer lesdits éléments. Cependant, la propriété de monotonie que cet opérateur ne présente pas, rend non seulement (i) difficile l’optimisation de ses requêtes dans un contexte multidimensionnel, mais aussi (ii) presque imprévisible la taille du résultat des requêtes. Ce travail se propose, dans un premier temps, d’aborder la question de l’estimation de la taille du résultat d’une requête skyline donnée, en formulant des estimateurs présentant de bonnes propriétés statistiques(sans biais ou convergeant). Ensuite, il fournit deux approches différentes à l’optimisation des requêtes skyline. La première reposant sur un concept classique des bases de données qui est la dépendance fonctionnelle. La seconde se rapprochant des techniques de compression des données. Ces deux techniques trouvent leur place au sein de l’état de l’art comme le confortent les résultats expérimentaux.Nous abordons enfin la question de requêtes skyline au sein de données dynamiques en adaptant l’une de nos solutions précédentes dans cet intérêt. / As part of the selection of the best items in a multidimensional database,several kinds of query were defined. The skyline operator has the advantage of not requiring the definition of a scoring function in order to classify tuples. However, the property of monotony that this operator does not satify, (i) makes difficult to optimize its queries in a multidimensional context, (ii) makes hard to estimate the size of query result. This work proposes, first, to address the question of estimating the size of the result of a given skyline query, formulating estimators with good statistical properties (unbiased or convergent). Then, it provides two different approaches to optimize multidimensional skyline queries. The first leans on a well known database concept: functional dependencies. And the second approach looks like a data compression method. Both algorithms are very interesting as confirm the experimental results. Finally, we address the issue of skyline queries in dynamic data by adapting one of our previous solutions in this goal. Skyline Cardinalité Taille Optimisation Skycube Skycuboid Dépendance fonctionnelle Skyline Cardinality Size Optimization Skycube Skycuboid Functional Dependency
8	An Introduction to Functional Independency in Relational Database Normalization Chen, Tennyson X., Liu, Sean Shuangquan, Meyer, Martin D., Gotterbarn, Don 17 May 2007 (has links) In this paper, we discuss the deficiencies of normal form definitions based on Functional Dependency and introduce a new normal form concept based on Functional Independency. Functional Independency has not been systematically investigated while there is a very strong theoretical foundation for the study of Functional Dependency in relational database normalization. This paper will demonstrate that considering Functional Dependency alone cannot eliminate some common data anomalies and the normalization process can yield better database designs with the addition of Functional Independency. attribute decomposition functional dependency functional independency normal form relational database design sub-domain dependency
9	Path-functional dependencies and the two-variable guarded fragment with counting Kourtis, Georgios January 2017 (has links) We examine how logical reasoning in the two-variable guarded fragment with counting quantifiers can be integrated with databases in the presence of certain integrity constraints, called path-functional dependencies. In more detail, we establish that the problems of satisfiability and finite satisfiability for the two-variable guarded fragment with counting quantifiers, a database, and binary path-functional dependencies are EXPTIME-complete; we also establish that the data complexity of these problems is NP-complete. We establish that query answering for the above fragment (with a database and binary path-functional dependencies) is 2-EXPTIME-complete with respect to arbitrary models, and provide a 2-EXPTIME upper bound for finite models. Finally, we establish that the data complexity of query answering is coNP-complete, both with respect to arbitrary and finite models. 004
10	Advancing the discovery of unique column combinations Abedjan, Ziawasch, Naumann, Felix January 2011 (has links) Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the wellknown GORDIAN algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCAGORDIAN combines the advantages of GORDIAN and our new algorithm HCA, and it significantly outperforms all previous work in many situations. / Unique-Spaltenkombinationen sind Spaltenkombinationen einer Datenbanktabelle, die nur einzigartige Werte beinhalten. Das Finden von Unique-Spaltenkombinationen spielt sowohl eine wichtige Rolle im Bereich der Grundlagenforschung von Informationssystemen als auch in Anwendungsgebieten wie dem Datenmanagement und der Erkenntnisgewinnung aus Datenbeständen. Vorhandene Algorithmen, die dieses Problem angehen, sind entweder Brute-Force oder benötigen zu viel Hauptspeicher. Deshalb können diese Algorithmen nur auf kleine Datenmengen angewendet werden. In dieser Arbeit werden der bekannte GORDIAN-Algorithmus und Apriori-basierte Algorithmen zum Zwecke weiterer Optimierung analysiert. Wir verbessern die Apriori Algorithmen durch eine effiziente Kandidatengenerierung und Heuristikbasierten Kandidatenfilter. Eine Hybride Lösung, HCA-GORDIAN, kombiniert die Vorteile von GORDIAN und unserem neuen Algorithmus HCA, welche die bisherigen Algorithmen hinsichtlich der Effizienz in vielen Situationen übertrifft. Apriori eindeutig funktionale Abhängigkeit Schlüsselentdeckung Data Profiling apriori unique functional dependency key discovery data profiling Data processing Computer science

Search results