Global ETD Search

561	Data Masking, Encryption, and their Effect on Classification Performance: Trade-offs Between Data Security and Utility Asenjo, Juan C. 01 January 2017 (has links) As data mining increasingly shapes organizational decision-making, the quality of its results must be questioned to ensure trust in the technology. Inaccuracies can mislead decision-makers and cause costly mistakes. With more data collected for analytical purposes, privacy is also a major concern. Data security policies and regulations are increasingly put in place to manage risks, but these policies and regulations often employ technologies that substitute and/or suppress sensitive details contained in the data sets being mined. Data masking and substitution and/or data encryption and suppression of sensitive attributes from data sets can limit access to important details. It is believed that the use of data masking and encryption can impact the quality of data mining results. This dissertation investigated and compared the causal effects of data masking and encryption on classification performance as a measure of the quality of knowledge discovery. A review of the literature found a gap in the body of knowledge, indicating that this problem had not been studied before in an experimental setting. The objective of this dissertation was to gain an understanding of the trade-offs between data security and utility in the field of analytics and data mining. The research used a nationally recognized cancer incidence database, to show how masking and encryption of potentially sensitive demographic attributes such as patients’ marital status, race/ethnicity, origin, and year of birth, could have a statistically significant impact on the patients’ predicted survival. Performance parameters measured by four different classifiers delivered sizable variations in the range of 9% to 10% between a control group, where the select attributes were untouched, and two experimental groups where the attributes were substituted or suppressed to simulate the effects of the data protection techniques. In practice, this represented a corroboration of the potential risk involved when basing medical treatment decisions using data mining applications where attributes in the data sets are masked or encrypted for patient privacy and security concerns. Big Data Data Analytics Data Mining Encryption Knowledge Discovery Masking Computer Engineering Computer Sciences
562	Unsupervised discovery of relations for analysis of textual data in digital forensics Louis, Anita Lily 23 August 2010 (has links) This dissertation addresses the problem of analysing digital data in digital forensics. It will be shown that text mining methods can be adapted and applied to digital forensics to aid analysts to more quickly, efficiently and accurately analyse data to reveal truly useful information. Investigators who wish to utilise digital evidence must examine and organise the data to piece together events and facts of a crime. The difficulty with finding relevant information quickly using the current tools and methods is that these tools rely very heavily on background knowledge for query terms and do not fully utilise the content of the data. A novel framework in which to perform evidence discovery is proposed in order to reduce the quantity of data to be analysed, aid the analysts' exploration of the data and enhance the intelligibility of the presentation of the data. The framework combines information extraction techniques with visual exploration techniques to provide a novel approach to performing evidence discovery, in the form of an evidence discovery system. By utilising unrestricted, unsupervised information extraction techniques, the investigator does not require input queries or keywords for searching, thus enabling the investigator to analyse portions of the data that may not have been identified by keyword searches. The evidence discovery system produces text graphs of the most important concepts and associations extracted from the full text to establish ties between the concepts and provide an overview and general representation of the text. Through an interactive visual interface the investigator can explore the data to identify suspects, events and the relations between suspects. Two models are proposed for performing the relation extraction process of the evidence discovery framework. The first model takes a statistical approach to discovering relations based on co-occurrences of complex concepts. The second model utilises a linguistic approach using named entity extraction and information extraction patterns. A preliminary study was performed to assess the usefulness of a text mining approach to digital forensics as against the traditional information retrieval approach. It was concluded that the novel approach to text analysis for evidence discovery presented in this dissertation is a viable and promising approach. The preliminary experiment showed that the results obtained from the evidence discovery system, using either of the relation extraction models, are sensible and useful. The approach advocated in this dissertation can therefore be successfully applied to the analysis of textual data for digital forensics Copyright / Dissertation (MSc)--University of Pretoria, 2010. / Computer Science / unrestricted Text analysis Text mining Information extraction Relation discovery Digital forensics UCTD
563	Link layer topology discovery in an uncooperative ethernet environment Delport, Johannes Petrus 27 August 2008 (has links) Knowledge of a network’s entities and the physical connections between them, a network’s physical topology, can be useful in a variety of network scenarios and applications. Administrators can use topology information for fault- finding, inventorying and network planning. Topology information can also be used during protocol and routing algorithm development, for performance prediction and as a basis for accurate network simulations. Specifically, from a network security perspective, threat detection, network monitoring, network access control and forensic investigations can benefit from accurate network topology information. The dynamic nature of large networks has led to the development of various automatic topology discovery techniques, but these techniques have mainly focused on cooperative network environments where network elements can be queried for topology related information. The primary objective of this study is to develop techniques for discovering the physical topology of an Ethernet network without the assistance of the network’s elements. This dissertation describes the experiments performed and the techniques developed in order to identify network nodes and the connections between these nodes. The product of the investigation was the formulation of an algorithm and heuristic that, in combination with measurement techniques, can be used for inferring the physical topology of a target network. / Dissertation (MSc)--University of Pretoria, 2008. / Computer Science / unrestricted Network management Ethernet Network security Network topology discovery Network mapping Network monitoring UCTD
564	Ranked Search on Data Graphs Varadarajan, Ramakrishna R. 10 March 2009 (has links) Graph-structured databases are widely prevalent, and the problem of effective search and retrieval from such graphs has been receiving much attention recently. For example, the Web can be naturally viewed as a graph. Likewise, a relational database can be viewed as a graph where tuples are modeled as vertices connected via foreign-key relationships. Keyword search querying has emerged as one of the most effective paradigms for information discovery, especially over HTML documents in the World Wide Web. One of the key advantages of keyword search querying is its simplicity – users do not have to learn a complex query language, and can issue queries without any prior knowledge about the structure of the underlying data. The purpose of this dissertation was to develop techniques for user-friendly, high quality and efficient searching of graph structured databases. Several ranked search methods on data graphs have been studied in the recent years. Given a top-k keyword search query on a graph and some ranking criteria, a keyword proximity search finds the top-k answers where each answer is a substructure of the graph containing all query keywords, which illustrates the relationship between the keyword present in the graph. We applied keyword proximity search on the web and the page graph of web documents to find top-k answers that satisfy user’s information need and increase user satisfaction. Another effective ranking mechanism applied on data graphs is the authority flow based ranking mechanism. Given a top-k keyword search query on a graph, an authority-flow based search finds the top-k answers where each answer is a node in the graph ranked according to its relevance and importance to the query. We developed techniques that improved the authority flow based search on data graphs by creating a framework to explain and reformulate them taking in to consideration user preferences and feedback. We also applied the proposed graph search techniques for Information Discovery over biological databases. Our algorithms were experimentally evaluated for performance and quality. The quality of our method was compared to current approaches by using user surveys. Graph Search Database Ranking Web Search Information Retrieval Document Summarization Information Discovery
565	Drug repositioning and indication discovery using description logics Croset, Samuel January 2014 (has links) Drug repositioning is the discovery of new indications for approved or failed drugs. This practice is commonly done within the drug discovery process in order to adjust or expand the application line of an active molecule. Nowadays, an increasing number of computational methodologies aim at predicting repositioning opportunities in an automated fashion. Some approaches rely on the direct physical interaction between molecules and protein targets (docking) and some methods consider more abstract descriptors, such as a gene expression signature, in order to characterise the potential pharmacological action of a drug (Chapter 1). On a fundamental level, repositioning opportunities exist because drugs perturb multiple biological entities, (on and off-targets) themselves involved in multiple biological processes. Therefore, a drug can play multiple roles or exhibit various mode of actions responsible for its pharmacology. The work done for my thesis aims at characterising these various modes and mechanisms of action for approved drugs, using a mathematical framework called description logics. In this regard, I first specify how living organisms can be compared to complex black box machines and how this analogy can help to capture biomedical knowledge using description logics (Chapter 2). Secondly, the theory is implemented in the Functional Therapeutic Chemical Classification System (FTC - https://www.ebi.ac.uk/chembl/ftc/), a resource defining over 20,000 new categories representing the modes and mechanisms of action of approved drugs. The FTC also indexes over 1,000 approved drugs, which have been classified into the mode of action categories using automated reasoning. The FTC is evaluated against a gold standard, the Anatomical Therapeutic Chemical Classification System (ATC), in order to characterise its quality and content (Chapter 3). Finally, from the information available in the FTC, a series of drug repositioning hypotheses were generated and made publicly available via a web application (https://www.ebi.ac.uk/chembl/research/ftc-hypotheses). A subset of the hypotheses related to the cardiovascular hypertension as well as for Alzheimer’s disease are further discussed in more details, as an example of an application (Chapter 4). The work performed illustrates how new valuable biomedical knowledge can be automatically generated by integrating and leveraging the content of publicly available resources using description logics and automated reasoning. The newly created classification (FTC) is a first attempt to formally and systematically characterise the function or role of approved drugs using the concept of mode of action. The open hypotheses derived from the resource are available to the community to analyse and design further experiments. 610.28
566	A strategy for a systematic approach to biomarker discovery validation : a study on lung cancer microarray data set Dol, Zulkifli January 2015 (has links) Cancer is a serious threat to human health and is now one of major causes of death worldwide. However, the complexity of the cancer makes the development of new and specific diagnostic tools particularly challenging. A number of different strategies have been developed for biomarker discovery in cancer using microarray data. The problem that typically needs to be addressed is the scale of the data sets; we simply do not have (or are likely to obtain) sufficient data for classical machine learning approaches for biomarker discovery to be properly validated. Obtaining a biomarker that is specific to a particular cancer is also very challenging. The initial promise that was held out for gene microarray work for the development of cancer biomarkers has not yet yielded the hoped for breakthroughs. This work discusses the construction of a strategy for a systematic approach to biomarker discovery validation using lung cancer gene expression microarray data based around non-small cell cancer and in patients which either stayed disease free after surgery (a five year window) or in which the disease progressed and re-occurred. As a means of assisting the validation purposes we have therefore looked at new methodologies for using existing biological knowledge to support machine learning biomarker discovery techniques. We employ text mining strategy using previously published literature for correlating biological concepts to a given phenotype. Pathway driven approaches through the use of Web Services and workflows, enabled the large-scale dataset to be analysed systematically. The results showed that it was possible, at least using this specific data set, to clearly differentiate between progressive disease and disease free patients using a set of biomarkers implicated in neuroendocrine signaling. A validation of the biomarkers identified was attempted in three separately published data sets. This analysis showed that although there was support for some of our findings in one of these data sets, this appeared to be a function of the close similarity in experimental design followed rather than through specific of the analysis method developed. 616.99
567	Design and Performance Evaluation of Service Discovery Protocols for Vehicular Networks Abrougui, Kaouther January 2011 (has links) Intelligent Transportation Systems (ITS) are gaining momentum among researchers. ITS encompasses several technologies, including wireless communications, sensor networks, data and voice communication, real-time driving assistant systems, etc. These states of the art technologies are expected to pave the way for a plethora of vehicular network applications. In fact, recently we have witnessed a growing interest in Vehicular Networks from both the research community and industry. Several potential applications of Vehicular Networks are envisioned such as road safety and security, traffic monitoring and driving comfort, just to mention a few. It is critical that the existence of convenience or driving comfort services do not negatively affect the performance of safety services. In essence, the dissemination of safety services or the discovery of convenience applications requires the communication among service providers and service requesters through constrained bandwidth resources. Therefore, service discovery techniques for vehicular networks must efficiently use the available common resources. In this thesis, we focus on the design of bandwidth-efficient and scalable service discovery protocols for Vehicular Networks. Three types of service discovery architectures are introduced: infrastructure-less, infrastructure-based, and hybrid architectures. Our proposed algorithms are network layer based where service discovery messages are integrated into the routing messages for a lightweight discovery. Moreover, our protocols use the channel diversity for efficient service discovery. We describe our algorithms and discuss their implementation. Finally, we present the main results of the extensive set of simulation experiments that have been used in order to evaluate their performance. Service discovery Vehicular networks Fault tolerance QoS Performance evaluation Algorithm Protocol design
568	Estimating the Local False Discovery Rate via a Bootstrap Solution to the Reference Class Problem: Application to Genetic Association Data Abbas Aghababazadeh, Farnoosh January 2015 (has links) Modern scientific technology such as microarrays, imaging devices, genome-wide association studies or social science surveys provide statisticians with hundreds or even thousands of tests to consider simultaneously. Testing many thousands of null hypotheses may increase the number of Type $I$ errors. In large-scale hypothesis testing, researchers can use different statistical techniques such as family-wise error rates, false discovery rates, permutation methods, local false discovery rate, where all available data usually should be analyzed together. In applications, the thousands of tests are related by a scientifically meaningful structure. Ignoring that structure can be misleading as it may increase the number of false positives and false negatives. As an example, in genome-wide association studies each test corresponds to a specific genetic marker. In such a case, the scientific structure for each genetic marker can be its minor allele frequency. In this research, the local false discovery rate as a relevant statistical approach is considered to analyze the thousands of tests together. We present a model for multiple hypothesis testing when the scientific structure of each test is incorporated as a co-variate. The purpose of this model is to incorporate the co-variate to improve the performance of testing procedures. The method we consider has different estimates depending on the tuning parameter. We would like to estimate the optimal value of that parameter by considering observed statistics. Thus, among those estimators, the one which minimizes the estimated errors due to bias and to variance is chosen by applying the bootstrap approach. Such an estimation method is called an adaptive reference class method. Under the combined reference class method, the effect of the co-variates is ignored and all null hypotheses should be analyzed together. In this research, under some assumptions for the co-variates and the prior probabilities, the proposed adaptive reference class method shows smaller error than the combined reference class method in estimating the local false discovery rate, when the number of tests gets large. We describe the adaptive reference class method to the coronary artery disease data, and we use simulation data to evaluate the performance of the estimator associated with the adaptive reference class method. Multiple Testing Local False Discovery Rtae Reference Class Bias-variance Trade Off Tuning Parameter Bootstrap Approach
569	MotifGP: DNA Motif Discovery Using Multiobjective Evolution Belmadani, Manuel January 2016 (has links) The motif discovery problem is becoming increasingly important for molecular biologists as new sequencing technologies are producing large amounts of data, at rates which are unprecedented. The solution space for DNA motifs is too large to search with naive methods, meaning there is a need for fast and accurate motif detection tools. We propose MotifGP, a multiobjective motif discovery tool evolving regular expressions that characterize overrepresented motifs in a given input dataset. This thesis describes and evaluates a multiobjective strongly typed genetic programming algorithm for the discovery of network expressions in DNA sequences. Using 13 realistic data sets, we compare the results of our tool, MotifGP, to that of DREME, a state-of-art program. MotifGP outperforms DREME when the motifs to be sought are long, and the specificity is distributed over the length of the motif. For shorter motifs, the performance of MotifGP compares favourably with the state-of-the-art method. Finally, we discuss the advantages of multi-objective optimization in the context of this specific motif discovery problem. Genetic Programming Multiobjective optimization Motif Discovery Evolutionary Computing Bioinformatics ChIP-seq
570	Analýza reálných dat produktové redakce Alza.cz pomocí metod DZD / Analysis of real data from Alza.cz product department using methods of KDD Válek, Martin January 2014 (has links) This thesis deals with data analysis using methods of knowledge discovery in databases. The goal is to select appropriate methods and tools for implementation of a specific project based on real data from Alza.cz product department. Data analysis is performed by using association rules and decision rules in the Lisp-Miner and decision trees in the RapidMiner. The methodology used is the CRISP-DM. The thesis is divided into three main sections. First section is focused on the theoretical summary of information about KDD. There are defined basic terms and described the types of tasks and methods of KDD. In the second section is introduced the methodology CRISP-DM. The practical part firstly introduces company Alza.cz and its goals for this task. Afterwards, the basic structure of the data and preparation for the next step (data mining) is described. In conclusion, the results are evaluated and the possibility of their use is outlined.

Search results