Global ETD Search

1	Dataset selection for aggregate model implementation in predictive data mining Lutu, P.E.N. (Patricia Elizabeth Nalwoga) 15 November 2010 (has links) Data mining has become a commonly used method for the analysis of organisational data, for purposes of summarizing data in useful ways and identifying non-trivial patterns and relationships in the data. Given the large volumes of data that are collected by business, government, non-government and scientific research organizations, a major challenge for data mining researchers and practitioners is how to select relevant data for analysis in sufficient quantities, in order to meet the objectives of a data mining task. This thesis addresses the problem of dataset selection for predictive data mining. Dataset selection was studied in the context of aggregate modeling for classification. The central argument of this thesis is that, for predictive data mining, it is possible to systematically select many dataset samples and employ different approaches (different from current practice) to feature selection, training dataset selection, and model construction. When a large amount of information in a large dataset is utilised in the modeling process, the resulting models will have a high level of predictive performance and should be more reliable. Aggregate classification models, also known as ensemble classifiers, have been shown to provide a high level of predictive accuracy on small datasets. Such models are known to achieve a reduction in the bias and variance components of the prediction error of a model. The research for this thesis was aimed at the design of aggregate models and the selection of training datasets from large amounts of available data. The objectives for the model design and dataset selection were to reduce the bias and variance components of the prediction error for the aggregate models. Design science research was adopted as the paradigm for the research. Large datasets obtained from the UCI KDD Archive were used in the experiments. Two classification algorithms: See5 for classification tree modeling and K-Nearest Neighbour, were used in the experiments. The two methods of aggregate modeling that were studied are One-Vs-All (OVA) and positive-Vs-negative (pVn) modeling. While OVA is an existing method that has been used for small datasets, pVn is a new method of aggregate modeling, proposed in this thesis. Methods for feature selection from large datasets, and methods for training dataset selection from large datasets, for OVA and pVn aggregate modeling, were studied. The experiments of feature selection revealed that the use of many samples, robust measures of correlation, and validation procedures result in the reliable selection of relevant features for classification. A new algorithm for feature subset search, based on the decision rule-based approach to heuristic search, was designed and the performance of this algorithm was compared to two existing algorithms for feature subset search. The experimental results revealed that the new algorithm makes better decisions for feature subset search. The information provided by a confusion matrix was used as a basis for the design of OVA and pVn base models which aren combined into one aggregate model. A new construct called a confusion graph was used in conjunction with new algorithms for the design of pVn base models. A new algorithm for combining base model predictions and resolving conflicting predictions was designed and implemented. Experiments to study the performance of the OVA and pVn aggregate models revealed the aggregate models provide a high level of predictive accuracy compared to single models. Finally, theoretical models to depict the relationships between the factors that influence feature selection and training dataset selection for aggregate models are proposed, based on the experimental results. / Thesis (PhD)--University of Pretoria, 2010. / Computer Science / unrestricted Dataset partitioning Data mining Bias reduction Predictive modeling Classification Model aggregation Ensemble classifiers Ova classification Pvn classification Dataset selection Featureselection Variable selection Large datasets Variance reduction Dataset sampling UCTD
2	PREVENTING DATA POISONING ATTACKS IN FEDERATED MACHINE LEARNING BY AN ENCRYPTED VERIFICATION KEY Mahdee, Jodayree 06 1900 (has links) Federated learning has gained attention recently for its ability to protect data privacy and distribute computing loads [1]. It overcomes the limitations of traditional machine learning algorithms by allowing computers to train on remote data inputs and build models while keeping participant privacy intact. Traditional machine learning offered a solution by enabling computers to learn patterns and make decisions from data without explicit programming. It opened up new possibilities for automating tasks, recognizing patterns, and making predictions. With the exponential growth of data and advances in computational power, machine learning has become a powerful tool in various domains, driving innovations in fields such as image recognition, natural language processing, autonomous vehicles, and personalized recommendations. traditional machine learning, data is usually transferred to a central server, raising concerns about privacy and security. Centralizing data exposes sensitive information, making it vulnerable to breaches or unauthorized access. Centralized machine learning assumes that all data is available at a central location, which is only sometimes practical or feasible. Some data may be distributed across different locations, owned by different entities, or subject to legal or privacy restrictions. Training a global model in traditional machine learning involves frequent communication between the central server and participating devices. This communication overhead can be substantial, particularly when dealing with large-scale datasets or resource-constrained devices. / Recent studies have uncovered security issues with most of the federated learning models. One common false assumption in the federated learning model is that participants are the attacker and would not use polluted data. This vulnerability enables attackers to train their models using polluted data and then send the polluted updates to the training server for aggregation, potentially poisoning the overall model. In such a setting, it is challenging for an edge server to thoroughly inspect the data used for model training and supervise any edge device. This study evaluates the vulnerabilities present in federated learning and explores various types of attacks that can occur. This paper presents a robust prevention scheme to address these vulnerabilities. The proposed prevention scheme enables federated learning servers to monitor participants actively in real-time and identify infected individuals by introducing an encrypted verification scheme. The paper outlines the protocol design of this prevention scheme and presents experimental results that demonstrate its effectiveness. / Thesis / Doctor of Philosophy (PhD) / federated learning models face significant security challenges and can be vulnerable to attacks. For instance, federated learning models assume participants are not attackers and will not manipulate the data. However, in reality, attackers can compromise the data of remote participants by inserting fake or altering existing data, which can result in polluted training results being sent to the server. For instance, if the sample data is an animal image, attackers can modify it to contaminate the training data. This paper introduces a robust preventive approach to counter data pollution attacks in real-time. It incorporates an encrypted verification scheme into the federated learning model, preventing poisoning attacks without the need for specific attack detection programming. The main contribution of this paper is a mechanism for detection and prevention that allows the training server to supervise real-time training and stop data modifications in each client's storage before and between training rounds. The training server can identify real-time modifications and remove infected remote participants with this scheme. Federated Learning Data Poisoning Attacks Security Vulnerabilities Model Aggregation Edge Server Supervision Attack Evaluation Prevention Scheme Real-time Monitoring Encrypted Verification Protocol Design Experimental Results Participant Identification Robustness in FL Machine Learning Security Model Integrity

Search results

Dataset selection for aggregate model implementation in predictive data mining

PREVENTING DATA POISONING ATTACKS IN FEDERATED MACHINE LEARNING BY AN ENCRYPTED VERIFICATION KEY