1 |
Strategy and methodology for enterprise data warehouse development : integrating data mining and social networking techniques for identifying different communities within the data warehouseRifaie, Mohammad January 2010 (has links)
Data warehouse technology has been successfully integrated into the information infrastructure of major organizations as potential solution for eliminating redundancy and providing for comprehensive data integration. Realizing the importance of a data warehouse as the main data repository within an organization, this dissertation addresses different aspects related to the data warehouse architecture and performance issues. Many data warehouse architectures have been presented by industry analysts and research organizations. These architectures vary from the independent and physical business unit centric data marts to the centralised two-tier hub-and-spoke data warehouse. The operational data store is a third tier which was offered later to address the business requirements for inter-day data loading. While the industry-available architectures are all valid, I found them to be suboptimal in efficiency (cost) and effectiveness (productivity). In this dissertation, I am advocating a new architecture (The Hybrid Architecture) which encompasses the industry advocated architecture. The hybrid architecture demands the acquisition, loading and consolidation of enterprise atomic and detailed data into a single integrated enterprise data store (The Enterprise Data Warehouse) where businessunit centric Data Marts and Operational Data Stores (ODS) are built in the same instance of the Enterprise Data Warehouse. For the purpose of highlighting the role of data warehouses for different applications, we describe an effort to develop a data warehouse for a geographical information system (GIS). We further study the importance of data practices, quality and governance for financial institutions by commenting on the RBC Financial Group case. v The development and deployment of the Enterprise Data Warehouse based on the Hybrid Architecture spawned its own issues and challenges. Organic data growth and business requirements to load additional new data significantly will increase the amount of stored data. Consequently, the number of users will increase significantly. Enterprise data warehouse obesity, performance degradation and navigation difficulties are chief amongst the issues and challenges. Association rules mining and social networks have been adopted in this thesis to address the above mentioned issues and challenges. We describe an approach that uses frequent pattern mining and social network techniques to discover different communities within the data warehouse. These communities include sets of tables frequently accessed together, sets of tables retrieved together most of the time and sets of attributes that mostly appear together in the queries. We concentrate on tables in the discussion; however, the model is general enough to discover other communities. We first build a frequent pattern mining model by considering each query as a transaction and the tables as items. Then, we mine closed frequent itemsets of tables; these itemsets include tables that are mostly accessed together and hence should be treated as one unit in storage and retrieval for better overall performance. We utilize social network construction and analysis to find maximum-sized sets of related tables; this is a more robust approach as opposed to a union of overlapping itemsets. We derive the Jaccard distance between the closed itemsets and construct the social network of tables by adding links that represent distance above a given threshold. The constructed network is analyzed to discover communities of tables that are mostly accessed together. The reported test results are promising and demonstrate the applicability and effectiveness of the developed approach.
|
2 |
Strategy and methodology for enterprise data warehouse development. Integrating data mining and social networking techniques for identifying different communities within the data warehouse.Rifaie, Mohammad January 2010 (has links)
Data warehouse technology has been successfully integrated into the information
infrastructure of major organizations as potential solution for eliminating redundancy and
providing for comprehensive data integration. Realizing the importance of a data
warehouse as the main data repository within an organization, this dissertation addresses
different aspects related to the data warehouse architecture and performance issues.
Many data warehouse architectures have been presented by industry analysts and
research organizations. These architectures vary from the independent and physical
business unit centric data marts to the centralised two-tier hub-and-spoke data warehouse.
The operational data store is a third tier which was offered later to address the business
requirements for inter-day data loading. While the industry-available architectures are all
valid, I found them to be suboptimal in efficiency (cost) and effectiveness (productivity).
In this dissertation, I am advocating a new architecture (The Hybrid Architecture)
which encompasses the industry advocated architecture. The hybrid architecture demands
the acquisition, loading and consolidation of enterprise atomic and detailed data into a
single integrated enterprise data store (The Enterprise Data Warehouse) where businessunit
centric Data Marts and Operational Data Stores (ODS) are built in the same instance
of the Enterprise Data Warehouse.
For the purpose of highlighting the role of data warehouses for different
applications, we describe an effort to develop a data warehouse for a geographical
information system (GIS). We further study the importance of data practices, quality and
governance for financial institutions by commenting on the RBC Financial Group case.
v
The development and deployment of the Enterprise Data Warehouse based on the
Hybrid Architecture spawned its own issues and challenges. Organic data growth and
business requirements to load additional new data significantly will increase the amount
of stored data. Consequently, the number of users will increase significantly. Enterprise
data warehouse obesity, performance degradation and navigation difficulties are chief
amongst the issues and challenges.
Association rules mining and social networks have been adopted in this thesis to
address the above mentioned issues and challenges. We describe an approach that uses
frequent pattern mining and social network techniques to discover different communities
within the data warehouse. These communities include sets of tables frequently accessed
together, sets of tables retrieved together most of the time and sets of attributes that
mostly appear together in the queries. We concentrate on tables in the discussion;
however, the model is general enough to discover other communities. We first build a
frequent pattern mining model by considering each query as a transaction and the tables
as items. Then, we mine closed frequent itemsets of tables; these itemsets include tables
that are mostly accessed together and hence should be treated as one unit in storage and
retrieval for better overall performance. We utilize social network construction and
analysis to find maximum-sized sets of related tables; this is a more robust approach as
opposed to a union of overlapping itemsets. We derive the Jaccard distance between the
closed itemsets and construct the social network of tables by adding links that represent
distance above a given threshold. The constructed network is analyzed to discover
communities of tables that are mostly accessed together. The reported test results are
promising and demonstrate the applicability and effectiveness of the developed approach.
|
Page generated in 0.1029 seconds