Global ETD Search

1051	A Personalized Smart Cube for Faster and Reliable Access to Data Antwi, Daniel K. 02 December 2013 (has links) Organizations own data sources that contain millions, billions or even trillions of rows and these data are usually highly dimensional in nature. Typically, these raw repositories are comprised of numerous independent data sources that are too big to be copied or joined, with the consequence that aggregations become highly problematic. Data cubes play an essential role in facilitating fast Online Analytical Processing (OLAP) in many multi-dimensional data warehouses. Current data cube computation techniques have had some success in addressing the above-mentioned aggregation problem. However, the combined problem of reducing data cube size for very large and highly dimensional databases, while guaranteeing fast query response times, has received less attention. Another issue is that most OLAP tools often causes users to be lost in the ocean of data while performing data analysis. Often, most users are interested in only a subset of the data. For example, consider in such a scenario, a business manager who wants to answer the crucial location-related business question. "Why are my sales declining at location X"? This manager wants fast, unambiguous location-aware answers to his queries. He requires access to only the relevant ltered information, as found from the attributes that are directly correlated with his current needs. Therefore, it is important to determine and to extract, only that small data subset that is highly relevant from a particular user's location and perspective. In this thesis, we present the Personalized Smart Cube approach to address the abovementioned scenario. Our approach consists of two main parts. Firstly, we combine vertical partitioning, partial materialization and dynamic computation to drastically reduce the size of the computed data cube while guaranteeing fast query response times. Secondly, our personalization algorithm dynamically monitors user query pattern and creates a personalized data cube for each user. This ensures that users utilize only that small subset of data that is most relevant to them. Our experimental evaluation of our Personalized Smart Cube approach showed that our work compared favorably with other state-of-the-art methods. We evaluated our work focusing on three main criteria, namely the storage space used, query response time and the cost savings ratio of using a personalized cube. The results showed that our algorithm materializes a relatively smaller number of views than other techniques and it also compared favourable in terms of query response time. Further, our personalization algorithm is superior to the state-of-the art Virtual Cube algorithm, when evaluated in terms of the number of user queries that were successfully answered when using a personalized cube, instead of the base cube. Data Cube Dynamic Data Cube Personalized Cube Smart Cube Smart Data Cube Data Cube partitioning Small Data cube
1052	Att Strukturera Data Med Nyckelord: Utvecklandet av en Skrapande Artefakt Bramell, Fredrik, From, From January 2022 (has links) Development of different methods for processing information has long been a central area in computer science. Being able to structure and compile different types of information can streamline many tasks that facilitate various assignments. In addition, the web is getting bigger and as a result larger amounts of information become more accessible. It also means that it can be more difficult to find and compile relevant information. This raises the questions; Is a layered architecture suitable for extracting semi-structured data from various web-based documents such as HTML and PDF and structuring the content as generically as possible? and How can you find semi-structured data in various forms of documents on the web based on keywords to save the data in tabular form? Review of previous research shows a gap when it comes to processing different levels of structures with the web as a source of data. When processing data, previous projects have usually used a layered architecture where each layer has a specific task and it is also this architecture that was chosen for this artifact. To create the artifact, the Design and Creation method is applied with an included literature study. This method is common in assignments where the goal is to create an artifact with the purpose of answering research questions. Tests of the artifact are also performed in this method and result in how well the artifact follows instructions and whether or not it can answer the research questions. This work has resulted in an artifact that works well and lays a foundation for future work. However, there is room for improvement such as that the artifact could be able to understand context and find more relevant information, but also for future research on how other software can be implemented to streamline and improve results. / Utveckling av olika metoder för att bearbeta information har länge varit ett centralt område inom datavetenskap. Att kunna strukturera och sammanställa olika typer av information kan effektivisera många uppgifter som underlättar olika uppdrag. Dessutom blir webben större och som ett resultat blir större mängder information mer tillgänglig. Det gör också att det kan vara svårare att hitta och sammanställa relevant information. Detta väcker frågorna; Lämpar sig lagerarkitektur för att extrahera semi-strukturerad data från olika webbaserade dokument som HTML och PDF och strukturera innehållet så generiskt som möjligt? och Hur kan man hitta semi-strukturerad data i olika former av dokument på webben baserat på nyckelord för att spara data i tabellform? Granskning av tidigare forskning visar på ett gap när det gäller att bearbeta olika nivåer av strukturer med webben som datakälla. Vid bearbetning av data har tidigare projekt vanligtvis använt en lagerarkitektur där varje lager har en specifik uppgift och det är även denna arkitektur som valdes för denna artefakt. För att skapa artefakten tillämpas Design and Creation metoden med en inkluderad litteraturstudie. Denna metod är vanlig i arbeten där målet är att skapa en artefakt med syftet att svara på forskningsfrågor. Tester av artefakten utförs också i denna metod och resulterar i hur väl artefakten följer instruktionerna och om den kan svara på forskningsfrågorna eller inte. Detta arbete har resulterat i en artefakt som fungerar bra och som lägger en grund för framtida arbete. Det finns dock utrymme för förbättringar som att artefakten skulle kunna förstå sammanhang och hitta mer relevant information, men också för framtida forskning om hur annan mjukvara kan implementeras för att effektivisera och förbättra resultat. Ostrukturerad data Semi-strukturerad data Strukturerad data Artefakt Web-scraping Bearbetning av data Arkitektur Computer and Information Sciences Data- och informationsvetenskap
1053	A Personalized Smart Cube for Faster and Reliable Access to Data Antwi, Daniel K. January 2013 (has links) Organizations own data sources that contain millions, billions or even trillions of rows and these data are usually highly dimensional in nature. Typically, these raw repositories are comprised of numerous independent data sources that are too big to be copied or joined, with the consequence that aggregations become highly problematic. Data cubes play an essential role in facilitating fast Online Analytical Processing (OLAP) in many multi-dimensional data warehouses. Current data cube computation techniques have had some success in addressing the above-mentioned aggregation problem. However, the combined problem of reducing data cube size for very large and highly dimensional databases, while guaranteeing fast query response times, has received less attention. Another issue is that most OLAP tools often causes users to be lost in the ocean of data while performing data analysis. Often, most users are interested in only a subset of the data. For example, consider in such a scenario, a business manager who wants to answer the crucial location-related business question. "Why are my sales declining at location X"? This manager wants fast, unambiguous location-aware answers to his queries. He requires access to only the relevant ltered information, as found from the attributes that are directly correlated with his current needs. Therefore, it is important to determine and to extract, only that small data subset that is highly relevant from a particular user's location and perspective. In this thesis, we present the Personalized Smart Cube approach to address the abovementioned scenario. Our approach consists of two main parts. Firstly, we combine vertical partitioning, partial materialization and dynamic computation to drastically reduce the size of the computed data cube while guaranteeing fast query response times. Secondly, our personalization algorithm dynamically monitors user query pattern and creates a personalized data cube for each user. This ensures that users utilize only that small subset of data that is most relevant to them. Our experimental evaluation of our Personalized Smart Cube approach showed that our work compared favorably with other state-of-the-art methods. We evaluated our work focusing on three main criteria, namely the storage space used, query response time and the cost savings ratio of using a personalized cube. The results showed that our algorithm materializes a relatively smaller number of views than other techniques and it also compared favourable in terms of query response time. Further, our personalization algorithm is superior to the state-of-the art Virtual Cube algorithm, when evaluated in terms of the number of user queries that were successfully answered when using a personalized cube, instead of the base cube. Data Cube Dynamic Data Cube Personalized Cube Smart Cube Smart Data Cube Data Cube partitioning Small Data cube
1054	Exploring Strategies for Implementing Data Governance Practices Cave, Ashley 01 January 2017 (has links) Data governance reaches across the field of information technology and is increasingly important for big data efforts, regulatory compliance, and ensuring data integrity. The purpose of this qualitative case study was to explore strategies for implementing data governance practices. This study was guided by institutional theory as the conceptual framework. The study's population consisted of informatics specialists from a small hospital, which is also a research institution in the Washington, DC, metropolitan area. This study's data collection included semi structured, in-depth individual interviews (n = 10), focus groups (n = 3), and the analysis of organizational documents (n = 19). By using methodological triangulation and by member checking with interviewees and focus group members, efforts were taken to increase the validity of this study's findings. Through thematic analysis, 5 major themes emerged from the study: structured oversight with committees and boards, effective and strategic communications, compliance with regulations, obtaining stakeholder buy-in, and benchmarking and standardization. The results of this study may benefit informatics specialists to better strategize future implementations of data governance and information management practices. By implementing effective data governance practices, organizations will be able to successfully manage and govern their data. These findings may contribute to social change by ensuring better protection of protected health information and personally identifiable information. Data Governance Data Governance Policy Data Governance Practices Data Management Data Policy Information Management Databases and Information Systems
1055	Master Data Integration hub - řešení pro konsolidaci referenčních dat v podniku / Master Data Integration hub - solution for company-wide consolidation of referrential data Bartoš, Jan January 2011 (has links) In current information systems the requirement to integrate disparate applications into cohesive package is greatly accented. While well-established technologies facilitating functional and comunicational integration (ESB, message brokes, web services) already exist, tools and methodologies for continuous integration of disparate data sources on enterprise-wide level are still in development. Master Data Management (MDM) is a major approach in the area of data integration and referrential data management in particular. It encompasses the referrential data integration, data quality management and referrential data consolidation, metadata management, master data ownership, principle of accountability for master data and processes related to referrential data management. Thesis is focused on technological aspects of MDM implementation realized via introduction of centrallized repository for master data -- Master Data Integration Hub (MDI Hub). MDI Hub is an application which enables the integration and consolidation of referrential data stored in disparate systems and applications based on predefined workflows. It also handles the master data propagation back to source systems and provides services like dictionaries management and data quality monitoring. Thesis objective is to cover design and implementation aspects of MDI Hub, which forms the application part of MDM. In introduction the motivation for referrential data consolidation is discussed and list of techniques used in MDI Hub solution development is presented. The main part of thesis proposes the design of MDI Hub referrential architecture and suggests the activities performed in process of MDI Hub implementation. Thesis is based on information gained from specialized publications, on knowledge gathererd by delivering projects with companies Adastra and Ataccama and on co-workers know-how and experience. Most important contribution of thesis is comprehensive view on MDI Hub design and MDI Hub referrential architecture proposal. MDI Hub referrential architecture can serve as basis for particular MDI Hub implementation.
1056	Otevřená data veřejné správy / Open Government Data Kučera, Jan January 2010 (has links) This Ph.D. thesis deals with Open Government Data and the methodology for publication of this kind of data. Public sector bodies hold a significant amount of data that can be reused in innovative way leading to development of new products and services. According to the Open Knowledge Foundation "Open data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike." Publication and reuse of Open Government Data can lead to benefits such as increased economic growth. State, society as well as the public sector bodies themselves can benefit from Open Government Data. However the public sector bodies currently face a number of problems and issues when publishing Open Government Data, e.g. regular updates of the published datasets are not always ensured. Different public sector bodies apply different approaches to publication of Open Government Data. The main goal of this thesis is to design the Open Government Data Publication Methodology which should address current problems related to the publication of Open Government Data.
1057	Applying the phi ratio in designing a musical scale Smit, Konrad van Zyl 03 1900 (has links) Thesis (MMus (Music))--University of Stellenbosch, 2005. / In this thesis, an attempt is made to create an aesthetically pleasing musical scale based on the ratio of phi. Precedents for the application of phi in aesthetic fields exist; noteworthy is Le Corbusier’s architectural works, the measurements of which are based on phi. A brief discussion of the unique mathematical properties of phi is given, followed by a discussion of the manifestations of phi in the physical ratios as they appear in animal and plant life. Specific scales which have found an application in art music are discussed, and the properties to which their success is attributable are identified. Consequently, during the design of the phi scale, these characteristics are incorporated. The design of the phi scale is facilitated by the use of the most sophisticated modern computer software in the field of psychacoustics. During the scale’s design process, particular emphasis is placed on the requirement of obtaining maximal sensory consonance. For this reason, an in-depth discussion of the theories regarding consonance perception is undertaken. During this discussion, the reader’s attention is drawn to the difference between musical and perceptual consonance, and a discussion of the developmental history of musical consonance is given. Lastly, the scale is tested to see whether it complies with the requirements for successful scales. Ratio analysis -- Data processing Consonance (Music) -- Data processing Cognition -- Data processing Psychoacoustics -- Data processing Pi-calculus -- Data processing Dissertations -- Music Theses -- Music
1058	Geometric Methods for Mining Large and Possibly Private Datasets Chen, Keke 07 July 2006 (has links) With the wide deployment of data intensive Internet applications and continued advances in sensing technology and biotechnology, large multidimensional datasets, possibly containing privacy-conscious information have been emerging. Mining such datasets has become increasingly common in business integration, large-scale scientific data analysis, and national security. The proposed research aims at exploring the geometric properties of the multidimensional datasets utilized in statistical learning and data mining, and providing novel techniques and frameworks for mining very large datasets while protecting the desired data privacy. The first main contribution of this research is the development of iVIBRATE interactive visualization-based approach for clustering very large datasets. The iVIBRATE framework uniquely addresses the challenges in handling irregularly shaped clusters, domain-specific cluster definition, and cluster-labeling of the data on disk. It consists of the VISTA visual cluster rendering subsystem, and the Adaptive ClusterMap Labeling subsystem. The second main contribution is the development of ``Best K Plot'(BKPlot) method for determining the critical clustering structures in multidimensional categorical data. The BKPlot method uniquely addresses two challenges in clustering categorical data: How to determine the number of clusters (the best K) and how to identify the existence of significant clustering structures. The method consists of the basic theory, the sample BKPlot theory for large datasets, and the testing method for identifying no-cluster datasets. The third main contribution of this research is the development of the theory of geometric data perturbation and its application in privacy-preserving data classification involving single party or multiparty collaboration. The key of geometric data perturbation is to find a good randomly generated rotation matrix and an appropriate noise component that provides satisfactory balance between privacy guarantee and data quality, considering possible inference attacks. When geometric perturbation is applied to collaborative multiparty data classification, it is challenging to unify the different geometric perturbations used by different parties. We study three protocols under the data-mining-service oriented framework for unifying the perturbations: 1) the threshold-satisfied voting protocol, 2) the space adaptation protocol, and 3) the space adaptation protocol with a trusted party. The tradeoffs between the privacy guarantee, the model accuracy and the cost are studied for the protocols. Geometric methods Information visualization Data mining Privacy-preserving data mining Data clustering Data classification Distributed collaborative data mining Categorical data clustering
1059	Genome-wide analyses of single cell phenotypes using cell microarrays Narayanaswamy, Rammohan, 1978- 29 August 2008 (has links) The past few decades have witnessed a revolution in recombinant DNA and nucleic acid sequencing technologies. Recently however, technologies capable of massively high-throughout, genome-wide data collection, combined with computational and statistical tools for data mining, integration and modeling have enabled the construction of predictive networks that capture cellular regulatory states, paving the way for ‘Systems biology’. Consequently, protein interactions can be captured in the context of a cellular interaction network and emergent ‘system’ properties arrived at, that may not have been possible by conventional biology. The ability to generate data from multiple, non-redundant experimental sources is one of the important facets to systems biology. Towards this end, we have established a novel platform called ‘spotted cell microarrays’ for conducting image-based genetic screens. We have subsequently used spotted cell microarrays for studying multidimensional phenotypes in yeast under different regulatory states. In particular, we studied the response to mating pheromone using a cell microarray comprised of the yeast non-essential deletion library and analyzed morphology changes to identify novel genes that were involved in mating. An important aspect of the mating response pathway is large-scale spatiotemporal changes to the proteome, an aspect of proteomics, still largely obscure. In our next study, we used an imaging screen and a computational approach to predict and validate the complement of proteins that polarize and change localization towards the mating projection tip. By adopting such hybrid approaches, we have been able to, not only study proteins involved in specific pathways, but also their behavior in a systemic context, leading to a broader comprehension of cell function. Lastly, we have performed a novel metabolic starvation-based screen using the GFP-tagged collection to study proteome dynamics in response to nutrient limitation and are currently in the process of rationalizing our observations through follow-up experiments. We believe this study to have implications in evolutionarily conserved cellular mechanisms such as protein turnover, quiescence and aging. Our technique has therefore been applied towards addressing several interesting aspects of yeast cellular physiology and behavior and is now being extended to mammalian cells. / text Genomics--Databases Genomics--Data processing DNA microarrays--Databases Phenotype--Data processing Pheromones--Data processing Proteomics--Data processing Proteins--Data processing
1060	Big Data Validation Rizk, Raya January 2018 (has links) With the explosion in usage of big data, stakes are high for companies to develop workflows that translate the data into business value. Those data transformations are continuously updated and refined in order to meet the evolving business needs, and it is imperative to ensure that a new version of a workflow still produces the correct output. This study focuses on the validation of big data in a real-world scenario, and implements a validation tool that compares two databases that hold the results produced by different versions of a workflow in order to detect and prevent potential unwanted alterations, with row-based and column-based statistics being used to validate the two versions. The tool was shown to provide accurate results in test scenarios, providing leverage to companies that need to validate the outputs of the workflows. In addition, by automating this process, the risk of human error is eliminated, and it has the added benefit of improved speed compared to the more labour-intensive manual alternative. All this allows for a more agile way of performing updates on the data transformation workflows by improving on the turnaround time of the validation process. big data data testing data validation data quality big data validation process big data validation tool Information Systems

Search results