• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 19646
  • 3370
  • 2417
  • 2007
  • 1551
  • 1432
  • 877
  • 406
  • 390
  • 359
  • 297
  • 234
  • 208
  • 208
  • 208
  • Tagged with
  • 38133
  • 12457
  • 9252
  • 7111
  • 6698
  • 5896
  • 5291
  • 5197
  • 4727
  • 3455
  • 3303
  • 2815
  • 2726
  • 2539
  • 2116
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
951

Privacy preserving data anonymisation: an experimental examination of customer data for POPI compliance in South Africa

Chetty, Nirvashnee January 2020 (has links)
Data has become an essential commodity in this day and age. Organisations want to share the massive amounts of data that they collect as a way to leverage and grow their businesses. On the other hand, the need to maintain privacy is critical in order to avoid the release of sensitive information. This has been shown to be a constant challenge, namely the trade-off between preserving privacy and data utility [1]. This study performs an evaluation of privacy models together with their relevant tools and techniques to ascertain whether data can be anonymised in such a way that it can be in compliance with the Protection of Personal Information (POPI) Act and preserve the privacy of individuals. The results of this research should provide a practical solution for organisations in South Africa to adequately anonymise customer data to ensure POPI Act compliance with the use of a software tool. An experimental environment was setup with the ARX de-identification tool as the tool of choice to implement the privacy models. Two privacy models, namely k-anonymity and ldiversity, were tested on a publicly available data set. Data quality models as well as privacy risk measures were implemented. The results of the study showed that when taking both data utility and privacy risks into consideration, neither privacy model was the clear winner. The K-anonymity privacy model was a better choice for data utility, whereas the l-diversity privacy model was a better choice for privacy preservation by reducing re-identification risks. Therefore, in relation to the aim of the study which is to compare the results of data anonymisation to ensure that data privacy needs are met more than data utility, the result showed that the l-diversity privacy model was the preferred model. Finally, considering that the POPI Act is still awaiting the final step to be promulgated, there is time to conduct further experiments in the various ways to practically implement and apply data anonymisation techniques in the day-to-day processing of data and information in South Africa.
952

Ant colony optimization approach for stacking configurations

CHEN, Yijun 01 January 2011 (has links)
In data mining, classifiers are generated to predict the class labels of the instances. An ensemble is a decision making system which applies certain strategies to combine the predictions of different classifiers and generate a collective decision. Previous research has empirically and theoretically demonstrated that an ensemble classifier can be more accurate and stable than its component classifiers in most cases. Stacking is a well-known ensemble which adopts a two-level structure: the base-level classifiers to generate predictions and the meta-level classifier to make collective decisions. A consequential problem is: what learning algorithms should be used to generate the base-level and meta-level classifier in the Stacking configuration? It is not easy to find a suitable configuration for a specific dataset. In some early works, the selection of a meta classifier and its training data are the major concern. Recently, researchers have tried to apply metaheuristic methods to optimize the configuration of the base classifiers and the meta classifier. Ant Colony Optimization (ACO), which is inspired by the foraging behaviors of real ant colonies, is one of the most popular approaches among the metaheuristics. In this work, we propose a novel ACO-Stacking approach that uses ACO to tackle the Stacking configuration problem. This work is the first to apply ACO to the Stacking configuration problem. Different implementations of the ACO-Stacking approach are developed. The first version identifies the appropriate learning algorithms in generating the base-level classifiers while using a specific algorithm to create the meta-level classifier. The second version simultaneously finds the suitable learning algorithms to create the base-level classifiers and the meta-level classifier. Moreover, we study how different kinds on local information of classifiers will affect the classification results. Several pieces of local information collected from the initial phase of ACO-Stacking are considered, such as the precision, f-measure of each classifier and correlative differences of paired classifiers. A series of experiments are performed to compare the ACO-Stacking approach with other ensembles on a number of datasets of different domains and sizes. The experiments show that the new approach can achieve promising results and gain advantages over other ensembles. The correlative differences of the classifiers could be the best local information in this approach. Under the agile ACO-Stacking framework, an application to deal with a direct marketing problem is explored. A real world database from a US-based catalog company, containing more than 100,000 customer marketing records, is used in the experiments. The results indicate that our approach can gain more cumulative response lifts and cumulative profit lifts in the top deciles. In conclusion, it is competitive with some well-known conventional and ensemble data mining methods.
953

Charla sobre aplicaciones de Bigdata en el mercado

Díaz Huiza, César, Quezada Balcázar, César 12 September 2019 (has links)
Cesar Díaz Huiza (DMC Perú) / César Quezada Balcázar (DMC Perú) / En la charla se desarrollará la evolución, importancia de Bigdata en el mercado y su impacto en la economía.
954

A Pattern Oriented Data Structure for Interactive Computer Music

Lockhart, Adam 05 1900 (has links)
This essay describes a pattern oriented data structure, or PODS, as a system for storing computer music data. It organizes input by sequences or patterns that recur, while extensively interlinking the data. The interlinking process emulates cognitive models, while the pattern processing draws specifically from music cognition. The project aims at creating open source external objects for the Max/MSP software environment. The computer code for this project is in the C and Objective-C computer programming languages.
955

Understanding the performance of healthcare services: a data-driven complex systems modeling approach

Tao, Li 13 February 2014 (has links)
Healthcare is of critical importance in maintaining people’s health and wellness. It has attracted policy makers, researchers, and practitioners around the world to .nd better ways to improve the performance of healthcare services. One of the key indicators for assessing that performance is to show how accessible and timely the services will be to speci.c groups of people in distinct geographic locations and in di.erent seasons, which is commonly re.ected in the so-called wait times of services. Wait times involve multiple related impact factors, called predictors, such as demographic characteristics, service capacities, and human behaviors. Some impact factors, especially individuals’ behaviors, may have mutual interactions, which can lead to tempo-spatial patterns in wait times at a systems level. The goal of this thesis is to gain a systematic understanding of healthcare services by investigating the causes and corresponding dynamics of wait times. This thesis presents a data-driven complex systems modeling approach to investigating the causes of tempo-spatial patterns in wait times from a self-organizing perspective. As the predictors of wait times may have direct, indirect, and/or moderating e.ects, referred to as complex e.ects, a Structural Equation Modeling (SEM)-based analysis method is proposed to discover the complex e.ects from aggregated data. Existing regression-based analysis techniques are only able to reveal pairwise relationships between observed variables, whereas this method allows us to explore the complex e.ects of observed and/or unobserved(latent) predictors on waittimes simultaneously. This thesis then considers how to estimate the variations in wait times with respect to changes in speci.c predictors and their revealed complex e.ects. An integrated projection method using the SEM-based analysis, projection, and a queuing model analysis is developed. Unlike existing studies that either make projections based primarily on pairwise relationships between variables, or queuing model-based discrete event simulations, the proposed method enables us to make a more comprehensive estimate by taking into account the complex e.ects exerted by multiple observed and latent predictors, and thus gain insights into the variations in the estimated wait times over time. This thesis further presents a method for designing and evaluating service management strategies to improve wait times, which are determined by service management behaviors. Our proposed strategy for allocating time blocks in operating rooms (ORs) incorporates historical feedback information about ORs and can adapt to the unpredictable changes in patient arrivals and hence shorten wait times. Existing time block allocations are somewhat ad hoc and are based primarily on the allocations in previous years, and thus result in ine.cient use of service resources. Finally, this thesis proposes a behavior-based autonomy-oriented modeling method for modeling and characterizing the emergent tempo-spatial patterns at a systems level by taking into account the underlying individuals’ behaviors with respect to various impact factors. This method uses multi-agent Autonomy-Oriented Computing (AOC), a computational modeling and problem-solving paradigm with a special focus on addressing the issues of self-organization and interactivity, to model heterogeneous individuals (entities), autonomous behaviors, and the mutual interactions between entities and certain impact factors. The proposed method therefore eliminates to a large extent the strong assumptions that are used to de.ne the stochastic properties of patient arrivalsand servicesinstochasticmodeling methods(e.g.,thequeuing model and discrete event simulation), and those of .xed relationships between entities that are held by system dynamics methods. The method is also more practical than agent-based modeling (ABM) for discovering the underlying mechanisms for emergent patterns, as AOC provides a general principle for explicitly stating what fundamental behaviors of and interactions between entities should be modeled. To demonstrate the e.ectiveness of the proposed systematic approach to understanding the dynamics and relevant patterns of wait times in speci.c healthcare service systems, we conduct a series of studies focusing on the cardiac care services in Ontario, Canada. Based on aggregated data that describe the services from 2004 to 2007, we use the SEM-based analysis method to (1) investigate the direct and moderating e.ects that speci.c demand factors, in terms of certaingeodemographicpro.les, exert onpatient arrivals, whichindirectly a.ect wait times; and (2) examine the e.ects of these factors (e.g., patient arrivals, physician supply, OR capacity, and wait times) on the wait times in subsequent units in a hospital. We present the e.ectiveness of integrated projection in estimating the regional changes in service utilization and wait times in cardiac surgery services in 2010-2011. We propose an adaptive OR time block allocation strategy and evaluate its performance based on a queuing model derived from the general perioperative practice. Finally, we demonstrate how to use the behavior-based autonomy-oriented modeling method to model and simulate the cardiac care system. We .nd that patients’ hospital selection behavior, hospitals’ service adjusting behavior, and their interactions via wait times may account for the emergent tempo-spatial patterns that are observed in the real-world cardiac care system. In summary, this thesis emphasizes the development of a data-driven complex systems modeling approach for understanding wait time dynamics in a healthcare service system. This approach will provide policy makers, researchers, and practitioners with a practically useful method for estimating the changes in wait times in various “what-if” scenarios, and will support the design and evaluation of resource allocation strategies for better wait times management. By addressing the problem of characterizing emergenttempo-spatial waittimepatternsinthe cardiac care system from a self-organizing perspective, we have provided a potentially e.ective means for investigating various self-organized patterns in complex healthcare systems. Keywords: Complex Healthcare Service Systems, Wait Times, Data-Driven Complex Systems Modeling, Autonomy-Oriented Computing(AOC), Cardiac Care
956

System Support for Large-scale Geospatial Data Analytics

January 2020 (has links)
abstract: The volume of available spatial data has increased tremendously. Such data includes but is not limited to: weather maps, socioeconomic data, vegetation indices, geotagged social media, and more. These applications need a powerful data management platform to support scalable and interactive analytics on big spatial data. Even though existing single-node spatial database systems (DBMSs) provide support for spatial data, they suffer from performance issues when dealing with big spatial data. Challenges to building large-scale spatial data systems are as follows: (1) System Scalability: The massive-scale of available spatial data hinders making sense of it using traditional spatial database management systems. Moreover, large-scale spatial data, besides its tremendous storage footprint, may be extremely difficult to manage and maintain due to the heterogeneous shapes, skewed data distribution and complex spatial relationship. (2) Fast analytics: When the user runs spatial data analytics applications using graphical analytics tools, she does not tolerate delays introduced by the underlying spatial database system. Instead, the user needs to see useful information quickly. In this dissertation, I focus on designing efficient data systems and data indexing mechanisms to bolster scalable and interactive analytics on large-scale geospatial data. I first propose a cluster computing system GeoSpark which extends the core engine of Apache Spark and Spark SQL to support spatial data types, indexes, and geometrical operations at scale. In order to reduce the indexing overhead, I propose Hippo, a fast, yet scalable, sparse database indexing approach. In contrast to existing tree index structures, Hippo stores disk page ranges (each works as a pointer of one or many pages) instead of tuple pointers in the indexed table to reduce the storage space occupied by the index. Moreover, I present Tabula, a middleware framework that sits between a SQL data system and a spatial visualization dashboard to make the user experience with the dashboard more seamless and interactive. Tabula adopts a materialized sampling cube approach, which pre-materializes samples, not for the entire table as in the SampleFirst approach, but for the results of potentially unforeseen queries (represented by an OLAP cube cell). / Dissertation/Thesis / Doctoral Dissertation Computer Science 2020
957

An APIfication Approach to Facilitate the Access and Reuse of Open Data

González Mora, César 10 September 2021 (has links)
Nowadays, there is a tendency to publish data on the Web, due to the benefits it brings to the society and the new legislation that encourage the opening of data. These collections of open data, also known as datasets, are typically published in open data portals by governments and institutions around the world in order to make it open -- available on the Web in a free and reusable manner. The common behaviour tends to be that publishers expose their data as individual tabular datasets. Open data is considered highly valuable because promoting the use of public information produces transparency, innovation and other social, political and economic benefits. Especially, this importance is also considerable in situational scenarios, where a small group of consumers (developers or data scientists) with specific needs require thematic data for a short life cycle. In order that these data consumers easily assess whether the data is adequate for their purpose there are different mechanisms. For example, SPARQL endpoints have become very useful for the consumption of open data, and particularly, Linked Open Data (LOD). Moreover, in order to access open data in a straightforward manner, Web Application Programming Interfaces (APIs) are also highly recommended feature of open data portals. However, accessing this open data is a difficult task since current Open Data platforms do not generally provide suitable strategies to access their data. On the one hand, accessing open data through SPARQL endpoints is a difficult task because it requires knowledge in different technologies, which is challenging especially for novice developers. Moreover, LOD is not usually available since most used formats in open data government portals are tabular. On the other hand, although providing Web APIs would facilitate developers to easily access open data for reusing it from open data portals’ catalogs, there is a lack of suitable Web APIs in open data portals. Moreover, in most cases, the currently available APIs only allow to access catalog’s metadata or to download entire data resources (i.e. coarse-grain access to data), hampering the reuse of data. In addition, as the open data is commonly published individually without considering potential relationships with other datasets, reusing several open datasets together is not a trivial task, thus requiring mechanisms that allow data consumers to integrate and access tabular open data published on the Web. Therefore, open data is not being used to its full potential because it is not easily accessible. As the access to open data is thus still limited for end-users, particularly those without programming skills, we propose a model-based approach to automatically generate Web APIs from open data. This APIfication approach takes into account the access to multiple integrated tabular datasets and the consumption of data in situational scenarios. Firstly, we focus on data that can be integrated by means of join and union operations. Then, we coin the term disposable Web APIs as an alternative mechanism for the consumption of open data in situational scenarios. These disposable Web APIs are created on-the-fly to be used temporarily by a user to consume specific open data. Accordingly, the main objective is to provide suitable mechanisms to easily access and reuse open data on the fly and in an integrated manner, solving the problem of difficult access through SPARQL endpoints for most data consumers and the lack of suitable Web APIs with easy access to open data. With this approach, we address both open data publishers and consumers, as long as the publishers will be able to include a Web API within their data, and data consumers or reusers will be benefited in those cases that a Web API pointing to the open data is missing. The results of the experiments conducted led us to conclude that users consider our generated Web APIs as easy to use, providing the desired open data, even though coming from different datasets and especially in situational scenarios. / Esta tesis ha sido financiada por la Universidad de Alicante mediante un contrato destinado a la formación predoctoral, y por la Generalitat Valenciana mediante una subvención para la contratación de personal investigador de carácter predoctoral (ACIF2019).
958

Mining for Significant Information from Unstructured and Structured Biological Data and Its Applications

Al-Azzam, Omar Ghazi January 2012 (has links)
Massive amounts of biological data are being accumulated in science. Searching for significant meaningful information and patterns from different types of data is necessary towards gaining knowledge from these large amounts of data available to users. However, data mining techniques do not normally deal with significance. Integrating data mining techniques with standard statistical procedures provides a way for mining statistically signi- ficant, interesting information from both structured and unstructured data. In this dissertation, different algorithms for mining significant biological information from both unstructured and structured data are proposed. A weighted-density-based approach is presented for mining item data from unstructured textual representations. Different algorithms in the area of radiation hybrid mapping are developed for mining significant information from structured binary data. The proposed algorithms have different applications in the ordering problem in radiation hybrid mapping including: identifying unreliable markers, and building solid framework maps. Effectiveness of the proposed algorithms towards improving map stability is demonstrated. Map stability is determined based on resampling analysis. The proposed algorithms deal effectively and efficiently with multidimensional data and also reduce computational cost dramatically. Evaluation shows that the proposed algorithms outperform comparative methods in terms of both accuracy and computation cost.
959

Statistical analysis of grouped data

Crafford, Gretel 01 July 2008 (has links)
The maximum likelihood (ML) estimation procedure of Matthews and Crowther (1995: A maximum likelihood estimation procedure when modelling in terms of constraints. South African Statistical Journal, 29, 29-51) is utilized to fit a continuous distribution to a grouped data set. This grouped data set may be a single frequency distribution or various frequency distributions that arise from a cross classification of several factors in a multifactor design. It will also be shown how to fit a bivariate normal distribution to a two-way contingency table where the two underlying continuous variables are jointly normally distributed. This thesis is organized in three different parts, each playing a vital role in the explanation of analysing grouped data with the ML estimation of Matthews and Crowther. In Part I the ML estimation procedure of Matthews and Crowther is formulated. This procedure plays an integral role and is implemented in all three parts of the thesis. In Part I the exponential distribution is fitted to a grouped data set to explain the technique. Two different formulations of the constraints are employed in the ML estimation procedure and provide identical results. The justification of the method is further motivated by a simulation study. Similar to the exponential distribution, the estimation of the normal distribution is also explained in detail. Part I is summarized in Chapter 5 where a general method is outlined to fit continuous distributions to a grouped data set. Distributions such as the Weibull, the log-logistic and the Pareto distributions can be fitted very effectively by formulating the vector of constraints in terms of a linear model. In Part II it is explained how to model a grouped response variable in a multifactor design. This multifactor design arise from a cross classification of the various factors or independent variables to be analysed. The cross classification of the factors results in a total of T cells, each containing a frequency distribution. Distribution fitting is done simultaneously to each of the T cells of the multifactor design. Distribution fitting is also done under the additional constraints that the parameters of the underlying continuous distributions satisfy a certain structure or design. The effect of the factors on the grouped response variable may be evaluated from this fitted design. Applications of a single-factor and a two-factor model are considered to demonstrate the versatility of the technique. A two-way contingency table where the two variables have an underlying bivariate normal distribution is considered in Part III. The estimation of the bivariate normal distribution reveals the complete underlying continuous structure between the two variables. The ML estimate of the correlation coefficient ρ is used to great effect to describe the relationship between the two variables. Apart from an application a simulation study is also provided to support the method proposed. / Thesis (PhD (Mathematical Statistics))--University of Pretoria, 2007. / Statistics / unrestricted
960

Scalable Dynamic Big Data Geovisualization With Spatial Data Structure

Siqi Gu (8779961) 29 April 2020 (has links)
Comparing to traditional cartography, big data geographic information processing is not a simple task at all, it requires special methods and methods. When existing geovisualization systems face millions of data, the zoom function and the dynamical data adding function usually cannot be satisfied at the same time. This research classify the existing methods of geovisualization, then analyze its functions and bottlenecks, analyze its applicability in the big data environment, and proposes a method that combines spatial data structure and iterative calculation on demand. It also proves that this method can effectively balance the performance of scaling and new data, and it is significantly better than the existing library in the time consumption of new data and scaling<br>

Page generated in 0.3866 seconds