Global ETD Search

131	Outlier Detection In Big Data Cao, Lei 29 March 2016 (has links) The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches. big data outlier detection data stream distributed algorithm data analytics
132	Analisando os dados do programa de melhoramento genético da raça nelore com data warehousing e data mining. / Analyzing the program of genetic improvement of nelore breed data with data warehousing and data mining. Valmir Ferreira Marques 28 October 2002 (has links) A base de dados do Programa de Melhoramento Genético da Raça Nelore está crescendo consideravelmente, com isso, a criação de um ambiente que dê apoio à análise dos dados do Programa é de fundamental importância. As tecnologias que são utilizadas para a criação de um ambiente analítico são os processos de Data Warehousing e de Data Mining. Neste trabalho, foram construídos um Data Warehouse e consultas OLAP para fornecer visões multidimensionais dos dados. Além das análises realizadas com as consultas, também foi utilizada uma ferramenta de Data Mining Visual. O ambiente analítico desenvolvido proporciona aos pesquisadores e criadores do Programa um maior poder de análise de seus dados. Todo o processo de desenvolvimento desse ambiente é aqui apresentado. / The Program of Genetic Improvement of Nelore Breed database have been growing considerably. Therefore, the creation of an environment to support the data analysis of Program is very important. The technologies that are used for the creation of an analytical environment are the Data Warehousing and the Data Mining processes. In this work, a Data Warehouse and OLAP consultations had been constructed to supply multidimensional views of the data. Beyond the analyses carried through with the consultations, a tool of Visual Data Mining also was used. The developed analytical environment provides to the researchers and cattlemen of the Program a greater power of data analysis. The whole process of development of this environment is presented here. business intelligence data mart data mining data warehousing OLAP PMGRN
133	Analisando os dados do programa de melhoramento genético da raça nelore com data warehousing e data mining. / Analyzing the program of genetic improvement of nelore breed data with data warehousing and data mining. Marques, Valmir Ferreira 28 October 2002 (has links) A base de dados do Programa de Melhoramento Genético da Raça Nelore está crescendo consideravelmente, com isso, a criação de um ambiente que dê apoio à análise dos dados do Programa é de fundamental importância. As tecnologias que são utilizadas para a criação de um ambiente analítico são os processos de Data Warehousing e de Data Mining. Neste trabalho, foram construídos um Data Warehouse e consultas OLAP para fornecer visões multidimensionais dos dados. Além das análises realizadas com as consultas, também foi utilizada uma ferramenta de Data Mining Visual. O ambiente analítico desenvolvido proporciona aos pesquisadores e criadores do Programa um maior poder de análise de seus dados. Todo o processo de desenvolvimento desse ambiente é aqui apresentado. / The Program of Genetic Improvement of Nelore Breed database have been growing considerably. Therefore, the creation of an environment to support the data analysis of Program is very important. The technologies that are used for the creation of an analytical environment are the Data Warehousing and the Data Mining processes. In this work, a Data Warehouse and OLAP consultations had been constructed to supply multidimensional views of the data. Beyond the analyses carried through with the consultations, a tool of Visual Data Mining also was used. The developed analytical environment provides to the researchers and cattlemen of the Program a greater power of data analysis. The whole process of development of this environment is presented here. business intelligence data mart data mining data warehousing OLAP PMGRN
134	INVESTIGATING MACHINE LEARNING ALGORITHMS WITH IMBALANCED BIG DATA Unknown Date (has links) Recent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such bias may lead to adverse consequences, some of them even life-threatening, when the existence of false negatives is generally costlier than false positives. The size of the minority class can vary from fair to extraordinary small, which can lead to different performance scores for machine learning algorithms. Class imbalance is a well-studied area for traditional data, i.e., not big data. However, there is limited research focusing on both rarity and severe class imbalance in big data. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2019. / FAU Electronic Theses and Dissertations Collection Algorithms Machine learning Big data--Data processing Big data
135	NEW METHODS FOR MINING SEQUENTIAL AND TIME SERIES DATA Al-Naymat, Ghazi January 2009 (has links) Doctor of Philosophy (PhD) / Data mining is the process of extracting knowledge from large amounts of data. It covers a variety of techniques aimed at discovering diverse types of patterns on the basis of the requirements of the domain. These techniques include association rules mining, classification, cluster analysis and outlier detection. The availability of applications that produce massive amounts of spatial, spatio-temporal (ST) and time series data (TSD) is the rationale for developing specialized techniques to excavate such data. In spatial data mining, the spatial co-location rule problem is different from the association rule problem, since there is no natural notion of transactions in spatial datasets that are embedded in continuous geographic space. Therefore, we have proposed an efficient algorithm (GridClique) to mine interesting spatial co-location patterns (maximal cliques). These patterns are used as the raw transactions for an association rule mining technique to discover complex co-location rules. Our proposal includes certain types of complex relationships – especially negative relationships – in the patterns. The relationships can be obtained from only the maximal clique patterns, which have never been used until now. Our approach is applied on a well-known astronomy dataset obtained from the Sloan Digital Sky Survey (SDSS). ST data is continuously collected and made accessible in the public domain. We present an approach to mine and query large ST data with the aim of finding interesting patterns and understanding the underlying process of data generation. An important class of queries is based on the flock pattern. A flock is a large subset of objects moving along paths close to each other for a predefined time. One approach to processing a “flock query” is to map ST data into high-dimensional space and to reduce the query to a sequence of standard range queries that can be answered using a spatial indexing structure; however, the performance of spatial indexing structures rapidly deteriorates in high-dimensional space. This thesis sets out a preprocessing strategy that uses a random projection to reduce the dimensionality of the transformed space. We use probabilistic arguments to prove the accuracy of the projection and to present experimental results that show the possibility of managing the curse of dimensionality in a ST setting by combining random projections with traditional data structures. In time series data mining, we devised a new space-efficient algorithm (SparseDTW) to compute the dynamic time warping (DTW) distance between two time series, which always yields the optimal result. This is in contrast to other approaches which typically sacrifice optimality to attain space efficiency. The main idea behind our approach is to dynamically exploit the existence of similarity and/or correlation between the time series: the more the similarity between the time series, the less space required to compute the DTW between them. Other techniques for speeding up DTW, impose a priori constraints and do not exploit similarity characteristics that may be present in the data. Our experiments demonstrate that SparseDTW outperforms these approaches. We discover an interesting pattern by applying SparseDTW algorithm: “pairs trading” in a large stock-market dataset, of the index daily prices from the Australian stock exchange (ASX) from 1980 to 2002. Data mining Spatial data Spatio-temporal Time series data
136	Using web services for customised data entry Deng, Yanbo January 2007 (has links) Scientific databases often need to be accessed from a variety of different applications. There are usually many ways to retrieve and analyse data already in a database. However, it can be more difficult to enter data which has originally been stored in different sources and formats (e.g. spreadsheets, other databases, statistical packages). This project focuses on investigating a generic, platform independent way to simplify the loading of databases. The proposed solution uses Web services as middleware to supply essential data management functionality such as inserting, updating, deleting and retrieval of data. These functions allow application developers to easily customise their own data entry applications according to local data sources, formats and user requirements. We implemented a Web service to support loading data to the Germinate database at the New Zealand Institute of Crop & Food Research (CFR). We also provided language specific client toolkits to help developers invoke the Web service. The toolkits allow applications to be easily customised for different platforms. In addition, we developed sample applications to help end users load data from their project data sources via the Web service. The Web service approach was evaluated through user and developer trials. The feedback from the developer trial showed that using Web services as middleware is a useful approach to allow developers and competent end users to customise data entry with minimal effort. More importantly, the customised client applications enabled end users to load data directly from their project spreadsheets and databases. It significantly reduced the effort required for exporting or transforming the source data. data integration data management data loading web services
137	The Discovery and Retrieval of Temporal Rules in Interval Sequence Data Winarko, Edi, edwin@ugm.ac.id January 2007 (has links) Data mining is increasingly becoming important tool in extracting interesting knowledge from large databases. Many industries are now using data mining tools for analysing their large collections of databases and making business decisions. Many data mining problems involve temporal aspects, with examples ranging from engineering to scientific research, finance and medicine. Temporal data mining is an extension of data mining which deals with temporal data. Mining temporal data poses more challenges than mining static data. While the analysis of static data sets often comes down to the question of data items, with temporal data there are many additional possible relations. One of the tasks in temporal data mining is the pattern discovery task, whose objective is to discover time-dependent correlations, patterns or rules between events in large volumes of data. To date, most temporal pattern discovery research has focused on events existing at a point in time rather than over a temporal interval. In comparison to static rules, mining with respect to time points provides semantically richer rules. However, accommodating temporal intervals offers rules that are richer still. This thesis addresses several issues related to the pattern discovery from interval sequence data. Despite its importance, this area of research has received relatively little attention and there are still many issues that need to be addressed. Three main issues that this thesis considers include the definition of what constitutes an interesting pattern in interval sequence data, the efficient mining for patterns in the data, and the identification of interesting patterns from a large number of discovered patterns. In order to deal with these issues, this thesis formulates the problem of discovering rules, which we term richer temporal association rules, from interval sequence databases. Furthermore, this thesis develops an efficient algorithm, ARMADA, for discovering richer temporal association rules. The algorithm does not require candidate generation. It utilizes a simple index, and only requires at most two database scans. In this thesis, a retrieval system is proposed to facilitate the selection of interesting rules from a set of discovered richer temporal association rules. To this end, a high-level query language specification, TAR-QL, is proposed to specify the criteria of the rules to be retrieved from the rule sets. Three low-level methods are developed to evaluate queries involving rule format conditions. In order to improve the performance of the methods, signature file based indexes are proposed. In addition, this thesis proposes the discovery of inter-transaction relative temporal association rules from event sequence databases. data mining temporal rule interval data sequence data
138	Data Warehouse : An Outlook of Current Usage of External Data Olsson, Marcus January 2002 (has links) <p>A data warehouse is a data collection that integrates large amounts of data from several sources, with the aim to support the decision-making process in a company. Data could be acquired from internal sources within the own organization, as well as from external sources outside the organization.</p><p>The comprehensive aim of this dissertation is to examine the current usage of external data and its sources for integration into DWs, in order to give users of a DW the best possible foundation for decision-making. In order to investigate this problem, we have conducted an interview study with DW developers.</p><p>Based on the interview study, the result shows that it is relative common to integrate external data into DWs. The study also identifies different types of external data that are integrated, and what external sources it is common to acquire data from. In addition, opportunities and pitfalls of integrating external data have also been highlighted.</p> Data warehouse External data Computer and systems science Data- och systemvetenskap
139	Utvärdering av riktlinjer för inkorporering av syndikat data i datalager : Praktikfältets syn på tillämpbarhet och nyttoeffekt av Strands riktlinjer för inkorporering av syndikat data i datalager. Helander, Magnus January 2005 (has links) <p>Inkorporering av extern data i datalager är problematiskt och problematiken bekräftas av aktuella undersökningar inom området. Detta har medfört att det utvecklats olika former av stöd för att bemöta och analysera problemen som organisationer ställs inför.</p><p>För organisationer är det i högsta grad viktigt att dess beslutsfattare är välinformerade och klarar av att selektera information från stora mängder data. Det är i dessa sammanhang som en datalagerlösning är en viktig hörnsten för att stödja analys och presentation av data som ursprungligen är lagrad i olika datakällor (både interna och externa). Genom att inkorporera extern data i datalagret uppnår datalagret en betydligt högre potential och således kan även organisationer och framförallt dess beslutsfattare utvinna stora fördelar.</p><p>Strand (2005) har tagit fram riktlinjer för att stödja inkorporeringsprocessen av extern data i datalager. Dock saknas en utvärdering av riktlinjerna. En utvärdering bidrar till att riktlinjernas trovärdighet stärks och att riktlinjerna på ett tidigt stadie förs in i en förvaltningsprocess.</p> Data warehouse Extern data Syndikat data Computer science Datavetenskap
140	Personal Information Environment: A Framework for Managing Personal Files across a Set of Devices MOHAMMAD, ATIF 06 August 2009 (has links) The advancement in computing in the last three decades has introduced many devices in our daily lives including personal computers, laptops, cellular devices and many more. The data we need for our processing needs is scattered among these devices. The availability of all the scattered data in the devices in use associated to an individual user as one is achieved in a Personal Information Environment. Data recharging is a technique used to achieve a Personal Information Environment for an individual user using data replication. In this thesis, we propose a data recharging scheme for an individual user’s Personal Information Environment. We study the data availability to a user by conducting a simulation using the data recharging algorithm. This data recharging approach is achieved by using master-slave data replication technique. / Thesis (Master, Computing) -- Queen's University, 2009-08-06 00:18:00.19 Personal Information Environment Data Recharging Data Replication Data Transmission

Search results