Spelling suggestions: "subject:"database"" "subject:"catabase""
561 |
Aggregation and Privacy in Multi-Relational DatabasesJafer, Yasser January 2012 (has links)
Most existing data mining approaches perform data mining tasks on a single data table. However, increasingly, data repositories such as financial data and medical records, amongst others, are stored in relational databases. The inability of applying traditional data mining techniques directly on such relational database thus poses a serious challenge. To address this issue, a number of researchers convert a relational database into one or more flat files and then apply traditional data mining algorithms. The above-mentioned process of transforming a relational database into one or more flat files usually involves aggregation. Aggregation functions such as maximum, minimum, average, standard deviation, count and sum are commonly used in such a flattening process.
Our research aims to address the following question: Is there a link between aggregation and possible privacy violations during relational database mining? In this research we investigate how, and if, applying aggregation functions will affect the privacy of a relational database, during supervised learning, or classification, where the target concept is known. To this end, we introduce the PBIRD (Privacy Breach Investigation in Relational Databases) methodology. The PBIRD methodology combines multi-view learning with feature selection, to discover the potentially dangerous sets of features as hidden within a database. Our approach creates a number of views, which consist of subsets of the data, with and without aggregation. Then, by identifying and investigating the set of selected features in each view, potential privacy breaches are detected. In this way, our PBIRD algorithm is able to discover those features that are correlated with the classification target that may also lead to revealing of sensitive information in the database.
Our experimental results show that aggregation functions do, indeed, change the correlation between attributes and the classification target. We show that with aggregation, we obtain a set of features which can be accurately linked to the classification target and used to predict (with high accuracy) the confidential information. On the other hand, the results show that, without aggregation we obtain another different set of potentially harmful features. By identifying the complete set of potentially dangerous attributes, the PBIRD methodology provides a solution where the database designers/owners can be warned, to subsequently perform necessary adjustments to protect the privacy of the relational database.
In our research, we also perform a comparative study to investigate the impact of aggregation on the classification accuracy and on the time required to build the models. Our results suggest that in the case where a database consists only of categorical data, aggregation should especially be used with caution. This is due to the fact that aggregation causes a decrease in overall accuracies of the resulting models. When the database contains mixed attributes, the results show that the accuracies without aggregation and with aggregation are comparable. However, even in such scenarios, schemas without aggregation tend to slightly outperform. With regard to the impact of aggregation on the model building time, the results show that, in general, the models constructed with aggregation require shorter building time. However, when the database is small and consists of nominal attributes with high cardinality, aggregation causes a slower model building time.
|
562 |
The Conceptual Integration Modelling Framework: Semantics and Query AnsweringEkaterina, Guseva January 2016 (has links)
In the context of business intelligence (BI), the accuracy and accessibility of information consolidation play an important role. Integrating data from different sources involves its transformation according to constraints expressed in an appropriate language. The Conceptual Integration Modelling framework (CIM) acts as such a language. The CIM is aimed to allow business users to specify what information is needed in a simplified and comprehensive language. Achieving this requires raising the level of abstraction to the conceptual level, so that users are able to pose queries expressed in a conceptual query language (CQL).
The CIM is comprised of three facets: an Extended Entity Relationship (EER) model
(a high level conceptual model that is used to design databases), a conceptual schema against which users pose their queries, a relational multidimensional model that represents data sources, and mappings between the conceptual schema and sources. Such mappings can be specified in two ways: in the first scenario, the so-called global-as-view (GAV), the global schema is mapped to views over the relational sources by specifying how to obtain tuples of the global relation from tuples in the sources. In the second scenario, sources may contain less detailed information (a more aggregated data) so the local relations are defined as views over global relations that is called as local-as-view (LAV).
In this thesis, we address the problem of expressibility and decidability of queries written in CQL. We first define the semantics of the CIM by translating the conceptual model so we could translate it into a set of first order sentences containing a class of conceptual dependencies (CDs) - tuple-generating dependencies (TGDs) and equality generating dependencies (EGDs), in addition to certain (first order) restrictions to express multidimensionality. Here a multidimensionality means that facts in a data warehouse can be described from different perspectives. The EGDs set the equality between tuples and the TGDs set the rule that two instances are in a subtype association (more precise definitions are given further in the thesis).
We use a non-conflicting class of conceptual dependencies that guarantees a query's decidability. The non-conflicting dependencies avoid an interaction between TGDs and
EGDs. Our semantics extend the existing semantics defined for extended entity relationship models to the notions of fact, dimension category, dimensional hierarchy and dimension attributes.
In addition, a class of conceptual queries will be defined and proven to be decidable.
A DL-Lite logic has been extensively used for query rewriting as it allows us to reduce the complexity of the query answering to AC0. Moreover, we present a query rewriting algorithm for the class of defined conceptual dependencies.
Finally, we consider the problem in light of GAV and LAV approaches and prove the query answering complexities. The query answering problem becomes decidable if we add certain constraints to a well-known set of EGDs + TGDs dependencies to guarantee summarizability. The query answering problem in light of the global-as-a-view approach of mapping has AC0 data complexity and EXPTIME combined complexity. This problem becomes coNP hard if we are to consider it a LAV approach of mapping.
|
563 |
Bezpečnostní aspekty implementace databázových systémů / Security aspects of database systems implementationPokorný, Tomáš January 2009 (has links)
The aim of this thesis is to provide a comprehensive overview of database systems security. Reader is introduced into the basis of information security and its development. Following chapter defines a concept of database system security using ISO/IEC 27000 Standard. The findings from this chapter form a complex list of requirements on database security. One chapter also deals with legal aspects of this domain. Second part of this thesis offers a comparison of four object-relational database systems - Oracle, IBM DB2, Microsoft SQL Server, and PostgreSQL. Comparative criteria are based upon the list of requirements on database security and reflect specific attributes of the specific type of the data model. The comparison aims at utilization possibilities of each database product as well as its limitations.
|
564 |
Právní ochrana databází / Legal protection of databasesŠlajerová, Martina January 2008 (has links)
The rising importance of technological development especially in Europe and the USA, which are the largest producers of databases, requires an internationally unified regulation for the protection of databases. The aim of this thesis is to present these issues and highlight the shortcomings of the current and proposed legislation in order to determine adequate legal protection of databases and how this can be achieved. The first chapter provides an overview of the definitions and the basic concepts. This includes the protection of databases by copyright and the sui generis right and a list of criteria in order to establish adequate legal protection. The second chapter outlines the legal protection of databases in the European Community, including the current legal system, its benefits and drawbacks and alternative protection to sui generis. Additionally the theories and current judicial practice are presented to further clarify the issue. The third chapter deals with the alternative legal protection to copyright of databases in the U.S. that in many ways differs from the European legislation. Similarly both copyright and database protection alternatives are presented in terms of legislation, case law and theories relevant to this area. The fourth and last chapter focuses on the international regulation of databases, and any potential changes in international regulation with regard to protecting databases. Moreover theories and rulings in this area are presented. Finally all the important points of this thesis are summarized and adequate legal measures for the protection of databases are suggested based on the advantages and disadvantages of the current forms of protection and judicial practice.
|
565 |
A methodology for database management of time-variant encodings and/or missing informationThrelfall, William John January 1988 (has links)
The problem presented is how to handle encoded data for which the encodings or decodings change with respect to time, and which contains codes indicating that certain data is unknown, invalid, or not applicable with respect to certain entities during certain time periods.
It is desirable to build a database management system that is capable of knowing about and being able to handle the changes in encodings and the missing information codes by embedding such knowledge in the data definition structure, in order to remove the necessity of having applications programmers and users constantly worrying about how the data is encoded.
The experimental database management language DEFINE is utilized to achieve the desired result, and a database structure is created for a real-life example of data which contains many examples of time-variant encodings and missing information. / Science, Faculty of / Computer Science, Department of / Graduate
|
566 |
Query languages for relational data base management systemsJervis, Brian January 1974 (has links)
A new data base independent query language for relational systems is presented. Queries in this language specify only properties of the data which is to be retrieved. An algorithm for reducing queries to a response relation is described. This reduction algorithm makes use of Micro-Planner to decide which relations in the data base are applicable to the query, and how these relations should be manipulated. A semantic model is used as the basis for this work. This query language is also compared with existing languages. / Science, Faculty of / Computer Science, Department of / Graduate
|
567 |
A survey of the data administration function in large Canadian organizationsMcCririck, Ian Bryce January 1979 (has links)
The object of this study was to survey large Canadian organizations in order to:
1) determine the extent to which these organizations have established a separate Data Administration function,
2) empirically test Nolan's Stage Model of EDP Growth as a predictor of a separate Data Administration function, and
3) survey the characteristics of the Data Administration function in those organizations that have formally established such a speciality. A survey package containing two questionnaires was sent to 555 large Canadian organizations in the private and public sectors. The "EDP Profile Questionnaire" was directed to the Manager of the EDP Activity in the surveyed organizations. This questionnaire is concerned with the EDP growth process and the existence of a Data Administrator. The "Data Administration Questionnaire" was directed to the Data Administrator in the surveyed organizations. This questionnaire is concerned with the characteristics and responsibilities of the Data Administration function. Analysis was performed on 254 EDP functions and 69 Data Administration functions. The results obtained indicate that the Data Administration function is not prevalent in large Canadian organizations; where the function does exist its role is a fairly minor one within the EDP activity. This study found that organizations with very large EDP activities and many years of experience with computers were more likely to have established a Data Administration function than-smaller and less experienced ones. Certain organizational types (those with discretionary funds available) were more likely, to have a Data Administration function than other types. The "maturity" of the organization's EDP activity was not found to be- a good predictor of the existence of a Data Administrator. The sampled Data Administration functions exhibited a wide dispersion in both the activities performed and the amount of time spent on each. Few policy setting activities were performed by the Data Administrators. The Data Administration function appeared to be focused on those "data bases" using a Data Base Management System. Organizational conflicts and a general misunderstanding of the function by EDP Management have likely held back the development of the function beyond one involved primarily with the support of DBMS application systems. Future research should be directed at- understanding these conflicts and misperceptions through an analysis of the decision process involved in establishing the Data Administration function. An attempt- should be made to more fully understand the data resource and how it might differ among organizational types. Before further use is made of Nolan's Stage Growth model, serious thought should be given to determining in more precise terms what the EDP growth process variables are and how they might best be measured. / Business, Sauder School of / Graduate
|
568 |
Automating physical reorganizational requirements at the access path level of a relational database management systemWeddell, Grant Edwin January 1980 (has links)
Any design of an access path level of a database management system must make allowance for physical reorganization requirements. The facilities provided for such requirements at the access path level have so far been primitive in nature (almost always, in fact, requiring complicated human intervention)This thesis begins to explore the notion of increasing the degree of automation of such requirements at the access path level; to consider the practical basis for self-adapting or self-organizing data management systems. Consideration is first given to the motivation (justification) of such a notion. Then, based on a review of the relevant aspects of a number of existing data management systems, we present a complete design specification and outline for a proposed access path level. Regarding this system we consider in detail the automation of two major aspects of physical organization: the clustering of records on mass storage media and the selection of secondary indices. The results of our analysis of these problems provides a basis for the ultimate demonstration of feasibility of such automation. / Science, Faculty of / Computer Science, Department of / Graduate
|
569 |
Toward a Data-Type-Based Real Time Geospatial Data Stream Management SystemZhang, Chengyang 05 1900 (has links)
The advent of sensory and communication technologies enables the generation and consumption of large volumes of streaming data. Many of these data streams are geo-referenced. Existing spatio-temporal databases and data stream management systems are not capable of handling real time queries on spatial extents. In this thesis, we investigated several fundamental research issues toward building a data-type-based real time geospatial data stream management system. The thesis makes contributions in the following areas: geo-stream data models, aggregation, window-based nearest neighbor operators, and query optimization strategies. The proposed geo-stream data model is based on second-order logic and multi-typed algebra. Both abstract and discrete data models are proposed and exemplified. I further propose two useful geo-stream operators, namely Region By and WNN, which abstract common aggregation and nearest neighbor queries as generalized data model constructs. Finally, I propose three query optimization algorithms based on spatial, temporal, and spatio-temporal constraints of geo-streams. I show the effectiveness of the data model through many query examples. The effectiveness and the efficiency of the algorithms are validated through extensive experiments on both synthetic and real data sets. This work established the fundamental building blocks toward a full-fledged geo-stream database management system and has potential impact in many applications such as hazard weather alerting and monitoring, traffic analysis, and environmental modeling.
|
570 |
System design of a discrepancy reporting systemPilewski, Frank Michael 30 March 2010 (has links)
Master of Science
|
Page generated in 0.0536 seconds