Global ETD Search

1031	Mining Truth Tables and Straddling Biclusters in Binary Datasets Owens, Clifford Conley 07 January 2010 (has links) As the world swims deeper into a deluge of data, binary datasets relating objects to properties can be found in many different fields. Such datasets abound in practically any area of interest, including biology, politics, entertainment, and education. This explosion calls for the definition of new types of patterns in binary data, as well as algorithms to find efficiently find these patterns. In this work, we introduce truth tables as a new class of patterns to be mined in binary datasets. Truth tables represent a subset of properties which exhibit maximal variability (and hence, suggest independence) in occurrence patterns over the underlying objects. Unlike other measures of independence, truth tables possess anti-monotone features that can be exploited in order to mine them effectively. We present a level-wise algorithm that takes advantage of these features, showing results on real and synthetic data. These results demonstrate the scalability of our algorithm. We also introduce new methods of mining straddling biclusters. Biclusters relate subsets of objects to subsets of properties they share within a single dataset. Straddling biclusters extend biclusters by relating a subset of objects to subsets of properties they share in two datasets. We present two levelwise algorithms, named UnionMiner and TwoMiner, which discover straddling biclusters efficiently by treating multiple datasets as a single dataset. We show results on real and synthetic data, and explore the advantages and limitations of each algorithm. We develop guidelines which suggest which of these algorithms is likely to perform better based on features of the datasets. / Master of Science data mining binary datasets
1032	Towards a Data Quality Framework for Heterogeneous Data Micic, Natasha, Neagu, Daniel, Campean, Felician, Habib Zadeh, Esmaeil 22 April 2017 (has links) Yes / Every industry has signiﬁcant data output as a product of their working process, and with the recent advent of big data mining and integrated data warehousing it is the case for a robust methodology for assessing the quality for sustainable and consistent processing. In this paper a review is conducted on Data Quality (DQ) in multiple domains in order to propose connections between their methodologies. This critical review suggests that within the process of DQ assessment of heterogeneous data sets, not often are they treated as separate types of data in need of an alternate data quality assessment framework. We discuss the need for such a directed DQ framework and the opportunities that are foreseen in this research area and propose to address it through degrees of heterogeneity. Heterogeneous data sets Data quality Metadata Data cleaning Data quality assessment.
1033	Large-scale data analysis using the Wigner function Earnshaw, Rae A., Lei, Ci, Li, Jing, Mugassabi, Souad, Vourdas, Apostolos January 2012 (has links) No / Large-scale data are analysed using the Wigner function. It is shown that the ‘frequency variable’ provides important information, which is lost with other techniques. The method is applied to ‘sentiment analysis’ in data from social networks and also to financial data. Data analysis Wigner function
1034	Towards a big data analytics platform with Hadoop/MapReduce framework using simulated patient data of a hospital system Chrimes, Dillon 28 November 2016 (has links) Background: Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges. The study objective was high performance establishment of interactive BDA platform of hospital system. Methods: A Hadoop/MapReduce framework formed the BDA platform with HBase (NoSQL database) using hospital-specific metadata and file ingestion. Query performance tested with Apache tools in Hadoop’s ecosystem. Results: At optimized iteration, Hadoop distributed file system (HDFS) ingestion required three seconds but HBase required four to twelve hours to complete the Reducer of MapReduce. HBase bulkloads took a week for one billion (10TB) and over two months for three billion (30TB). Simple and complex query results showed about two seconds for one and three billion, respectively. Interpretations: BDA platform of HBase distributed by Hadoop successfully under high performance at large volumes representing the Province’s entire data. Inconsistencies of MapReduce limited operational efficiencies. Importance of the Hadoop/MapReduce on representation of health informatics is further discussed. / Graduate / 0566 / 0769 / 0984 / dillon.chrimes@viha.ca Big Data Big Data Analytics Big Data Tools Big Data Visualizations Hadoop Ecosystem Health Big Data Hospital Systems Interactive Big Data Patient Data Simulations
1035	Large Data Clustering And Classification Schemes For Data Mining Babu, T Ravindra 12 1900 (has links) Data Mining deals with extracting valid, novel, easily understood by humans, potentially useful and general abstractions from large data. A data is large when number of patterns, number of features per pattern or both are large. Largeness of data is characterized by its size which is beyond the capacity of main memory of a computer. Data Mining is an interdisciplinary field involving database systems, statistics, machine learning, visualization and computational aspects. The focus of data mining algorithms is scalability and efficiency. Large data clustering and classification is an important activity in Data Mining. The clustering algorithms are predominantly iterative requiring multiple scans of dataset, which is very expensive when data is stored on the disk. In the current work we propose different schemes that have both theoretical validity and practical utility in dealing with such a large data. The schemes broadly encompass data compaction, classification, prototype selection, use of domain knowledge and hybrid intelligent systems. The proposed approaches can be broadly classified as (a) compressing the data by some means in a non-lossy manner; cluster as well as classify the patterns in their compressed form directly through a novel algorithm, (b) compressing the data in a lossy fashion such that a very high degree of compression and abstraction is obtained in terms of 'distinct subsequences'; classify the data in such compressed form to improve the prediction accuracy, (c) with the help of incremental clustering, a lossy compression scheme and rough set approach, obtain simultaneous prototype and feature selection, (d) demonstrate that prototype selection and data-dependent techniques can reduce number of comparisons in multiclass classification scenario using SVMs, and (e) by making use of domain knowledge of the problem and data under consideration, we show that we obtaina very high classification accuracy with less number of iterations with AdaBoost. The schemes have pragmatic utility. The prototype selection algorithm is incremental, requiring a single dataset scan and has linear time and space requirements. We provide results obtained with a large, high dimensional handwritten(hw) digit data. The compression algorithm is based on simple concepts, where we demonstrate that classification of the compressed data improves computation time required by a factor 5 with prediction accuracy with both compressed and original data being exactly the same as 92.47%. With the proposed lossy compression scheme and pruning methods, we demonstrate that even with a reduction of distinct sequences by a factor of 6 (690 to 106), the prediction accuracy improves. Specifically, with original data containing 690 distinct subsequences, the classification accuracy is 92.47% and with appropriate choice of parameters for pruning, the number of distinct subsequences reduces to 106 with corresponding classification accuracy as 92.92%. The best classification accuracy of 93.3% is obtained with 452 distinct subsequences. With the scheme of simultaneous feature and prototype selection, we improved classification accuracy to better than that obtained with kNNC, viz., 93.58%, while significantly reducing the number of features and prototypes, achieving a compaction of 45.1%. In case of hybrid schemes based on SVM, prototypes and domain knowledge based tree(KB-Tree), we demonstrated reduction in SVM training time by 50% and testing time by about 30% as compared to complete data and improvement of classification accuracy to 94.75%. In case of AdaBoost the classification accuracy is 94.48%, which is better than those obtained with NNC and kNNC on the entire data; the training timing is reduced because of use of prototypes instead of the complete data. Another important aspect of the work is to devise a KB-Tree (with maximum depth of 4), that classifies a 10-category data in just 4 comparisons. In addition to hw data, we applied the schemes to Network Intrusion Detection Data (10% dataset of KDDCUP99) and demonstrated that the proposed schemes provided less overall cost than the reported values. Data Mining Data Classification Image Processing Data Clustering Data Compaction Data Mining - Algorithms Hybrid Intelligent Systems Data Reduction Data Representation Hybrid Schemes Hybrid Intelligent Methods Computer Science
1036	Datová kvalita v prostředí otevřených a propojitelných dat / Data quality on the context of open and linked data Tomčová, Lucie January 2014 (has links) The master thesis deals with data quality in the context of open and linked data. One of the goals is to define specifics of data quality in this context. The specifics are perceived mainly with orientation to data quality dimensions (i. e. data characteristics which we study in data quality) and possibilities of their measurement. The thesis also defines the effect on data quality that is connected with data transformation to linked data; the effect if defined with consideration to possible risks and benefits that can influence data quality. The list of metrics verified on real data (open linked data published by government institution) is composed for the data quality dimensions that are considered to be relevant in context of open and linked data. The thesis points to the need of recognition of differences that are specific in this context when assessing and managing data quality. At the same time, it offers possibilities for further study of this question and it presents subsequent directions for both theoretical and practical evolution of the topic.
1037	Framework pro tvorbu generátorů dat / Framework for Data Generators Kříž, Blažej January 2012 (has links) This master's thesis is focused on the problem of data generation. At the beginning, it presents several applications for data generation and describes the data generation process. Then it deals with development of framework for data generators and demonstrational application for validating the framework.
1038	A Survey of Methods for Visualizing Spatio-temporal Data Persson, Mattias January 2020 (has links) Olika typer av data genereras kontinuerligt varje sekund och för att kunna analysera denna data måste den transformeras till någon typ av visuell representation. En vanlig typ av data är spatio-temporal data, vilket är data som existerar i både rymd och tid. Hur denna typ av data kan visualiseras har undersökts under en lång period och området är fortfarande relevant idag. Ett antal metoder har undersökts i detta arbete och en genomgående litteraturstudie har genomförts. En applikation som implementerar ett antal av dessa undersökta metoder för att visualisera klimatdata har även skapats. / Different kinds of data is generated continuously every second and in order to be ableto analyze this data it has to be transformed into some kind of visual representation. Onecommon type of data is spatio-temporal data, which is data that exists in both space andtime. How to visualize this kind of data have been researched for a long time and is still avery relevant subject to expand on today. A number of approaches have been explored inthis work. An extensive literature study has also been performed and can be read in thisreport. The study has been divided into different classifications of spatio-temporal dataand the visual representations are structured by these classes.Another contribution of this thesis is a climate data application to visualize spatiotemporaldata sets of temperatures collected for several countries in the world. This applicationimplements several of the visual representations presented in the survey includedin this thesis. This resulted in a four display application, each showing a different aspect ofthe chosen data sets that consisted of climate data. The result shows how effective multiplelinked views are in order to understand different characteristics of the data. spatio-temporal data data analysis information visualization survey climate-data choropleth map heat map raster data event data movement data point reference data Media and Communication Technology Medieteknik
1039	Navigating the Risks of Dark Data : An Investigation into Personal Safety Gautam, Anshu January 2023 (has links) With the exponential proliferation of data, there has been a surge in data generation fromdiverse sources, including social media platforms, websites, mobile devices, and sensors.However, not all data is readily visible or accessible to the public, leading to the emergence ofthe concept known as "dark data." This type of data can exist in structured or unstructuredformats and can be stored in various repositories, such as databases, log files, and backups.The reasons behind data being classified as "dark" can vary, encompassing factors such as limited awareness, insufficient resources or tools for data analysis, or a perception ofirrelevance to current business operations. This research employs a qualitative research methodology incorporating audio/videorecordings and personal interviews to gather data, aiming to gain insights into individuals'understanding of the risks associated with dark data and their behaviors concerning thesharing of personal information online. Through the thematic analysis of the collected data,patterns and trends in individuals' risk perceptions regarding dark data become evident. The findings of this study illuminate the multiple dimensions of individuals' risk perceptions andt heir influence on attitudes towards sharing personal information in online contexts. Theseinsights provide valuable understanding of the factors that shape individuals' decisionsconcerning data privacy and security in the digital era. By contributing to the existing body ofknowledge, this research offers a deeper comprehension of the interplay between dark datarisks, individuals' perceptions, and their behaviors pertaining to online information sharing.The implications of this study can inform the development of strategies and interventionsaimed at fostering informed decision-making and ensuring personal safety in an increasinglydata-centric world Dark data Hidden data Big data Unstructured data Missing data Privacy Cybersecurity Personal data Data storage Consumer protection Information Systems, Social aspects
1040	INTERACTIVE ANALYSIS AND DISPLAY SYSTEM (IADS) Mattingly, Patrick, Suszek, Eileen, Bretz, James 10 1900 (has links) International Telemetering Conference Proceedings / October 20-23, 2003 / Riviera Hotel and Convention Center, Las Vegas, Nevada / The Interactive Analysis and Display System (IADS) provides the test engineer with enhanced test-data processing, management and display capabilities necessary to perform critical data monitoring in near real-time during a test mission. The IADS provides enhanced situational awareness through a display capability designed to increase the confidence of the engineer in making clearance decisions within a Mission Control Room (MCR) environment. The engineer achieves this confidence level through IADS’ real-time display capability (every data point) and simultaneous near real-time processing capability consisting of both time and frequency domain analyses. The system displays real-time data while performing interactive and automated near real-time analyses; alerting the engineer when displayed data exceed predefined threshold limits. The IADS provides a post-test capability at the engineer’s project area desktop, with a user interface common with the real-time system. The IADS promotes teamwork by allowing engineers to share data and test results during a mission and in the post-test environment. The IADS was originally developed for the government’s premier flight test programs. IADS has set the standard for MCR advancements in data acquisition and monitoring and is currently being integrated into all the existing MCR disciplines. Flight Test Data Monitoring Data Analysis Mission Data Post-Test Data Analysis Post-Flight Data Analysis Quick-look

Search results