• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 20
  • 4
  • 2
  • 1
  • Tagged with
  • 33
  • 33
  • 8
  • 8
  • 7
  • 7
  • 7
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Automated identification of digital evidence across heterogeneous data resources

Mohammed, Hussam J. January 2018 (has links)
Digital forensics has become an increasingly important tool in the fight against cyber and computer-assisted crime. However, with an increasing range of technologies at people's disposal, investigators find themselves having to process and analyse many systems with large volumes of data (e.g., PCs, laptops, tablets, and smartphones) within a single case. Unfortunately, current digital forensic tools operate in an isolated manner, investigating systems and applications individually. The heterogeneity and volume of evidence place time constraints and a significant burden on investigators. Examples of heterogeneity include applications such as messaging (e.g., iMessenger, Viber, Snapchat, and WhatsApp), web browsers (e.g., Firefox and Google Chrome), and file systems (e.g., NTFS, FAT, and HFS). Being able to analyse and investigate evidence from across devices and applications in a universal and harmonized fashion would enable investigators to query all data at once. In addition, successfully prioritizing evidence and reducing the volume of data to be analysed reduces the time taken and cognitive load on the investigator. This thesis focuses on the examination and analysis phases of the digital investigation process. It explores the feasibility of dealing with big and heterogeneous data sources in order to correlate the evidence from across these evidential sources in an automated way. Therefore, a novel approach was developed to solve the heterogeneity issues of big data using three developed algorithms. The three algorithms include the harmonising, clustering, and automated identification of evidence (AIE) algorithms. The harmonisation algorithm seeks to provide an automated framework to merge similar datasets by characterising similar metadata categories and then harmonising them in a single dataset. This algorithm overcomes heterogeneity issues and makes the examination and analysis easier by analysing and investigating the evidential artefacts across devices and applications based on the categories to query data at once. Based on the merged datasets, the clustering algorithm is used to identify the evidential files and isolate the non-related files based on their metadata. Afterwards, the AIE algorithm tries to identify the cluster holding the largest number of evidential artefacts through searching based on two methods: criminal profiling activities and some information from the criminals themselves. Then, the related clusters are identified through timeline analysis and a search of associated artefacts of the files within the first cluster. A series of experiments using real-life forensic datasets were conducted to evaluate the algorithms across five different categories of datasets (i.e., messaging, graphical files, file system, internet history, and emails), each containing data from different applications across different devices. The results of the characterisation and harmonisation process show that the algorithm can merge all fields successfully, with the exception of some binary-based data found within the messaging datasets (contained within Viber and SMS). The error occurred because of a lack of information for the characterisation process to make a useful determination. However, on further analysis, it was found that the error had a minimal impact on subsequent merged data. The results of the clustering process and AIE algorithm showed the two algorithms can collaborate and identify more than 92% of evidential files.
2

On evaluating similarity between heterogeneous data

POPOVICI, STEFANA A. 19 September 2008 (has links)
No description available.
3

Statistical methods for robust analysis of transcriptome data by integration of biological prior knowledge / Méthodes statistiques pour une analyse robuste du transcriptome à travers l'intégration d'a priori biologique

Jeanmougin, Marine 16 November 2012 (has links)
Au cours de la dernière décennie, les progrès en Biologie Moléculaire ont accéléré le développement de techniques d'investigation à haut-débit. En particulier, l'étude du transcriptome a permis des avancées majeures dans la recherche médicale. Dans cette thèse, nous nous intéressons au développement de méthodes statistiques dédiées au traitement et à l'analyse de données transcriptomiques à grande échelle. Nous abordons le problème de sélection de signatures de gènes à partir de méthodes d'analyse de l'expression différentielle et proposons une étude de comparaison de différentes approches, basée sur plusieurs stratégies de simulations et sur des données réelles. Afin de pallier les limites de ces méthodes classiques qui s'avèrent peu reproductibles, nous présentons un nouvel outil, DiAMS (DIsease Associated Modules Selection), dédié à la sélection de modules de gènes significatifs. DiAMS repose sur une extension du score-local et permet l'intégration de données d'expressions et de données d'interactions protéiques. Par la suite, nous nous intéressons au problème d'inférence de réseaux de régulation de gènes. Nous proposons une méthode de reconstruction à partir de modèles graphiques Gaussiens, basée sur l'introduction d'a priori biologique sur la structure des réseaux. Cette approche nous permet d'étudier les interactions entre gènes et d'identifier des altérations dans les mécanismes de régulation, qui peuvent conduire à l'apparition ou à la progression d'une maladie. Enfin l'ensemble de ces développements méthodologiques sont intégrés dans un pipeline d'analyse que nous appliquons à l'étude de la rechute métastatique dans le cancer du sein. / Recent advances in Molecular Biology have led biologists toward high-throughput genomic studies. In particular, the investigation of the human transcriptome offers unprecedented opportunities for understanding cellular and disease mechanisms. In this PhD, we put our focus on providing robust statistical methods dedicated to the treatment and the analysis of high-throughput transcriptome data. We discuss the differential analysis approaches available in the literature for identifying genes associated with a phenotype of interest and propose a comparison study. We provide practical recommendations on the appropriate method to be used based on various simulation models and real datasets. With the eventual goal of overcoming the inherent instability of differential analysis strategies, we have developed an innovative approach called DiAMS, for DIsease Associated Modules Selection. This method was applied to select significant modules of genes rather than individual genes and involves the integration of both transcriptome and protein interactions data in a local-score strategy. We then focus on the development of a framework to infer gene regulatory networks by integration of a biological informative prior over network structures using Gaussian graphical models. This approach offers the possibility of exploring the molecular relationships between genes, leading to the identification of altered regulations potentially involved in disease processes. Finally, we apply our statistical developments to study the metastatic relapse of breast cancer.
4

Grid-based semantic integration of heterogeneous data resources : implementation on a HealthGrid

Naseer, Aisha January 2007 (has links)
The semantic integration of geographically distributed and heterogeneous data resources still remains a key challenge in Grid infrastructures. Today's mainstream Grid technologies hold the promise to meet this challenge in a systematic manner, making data applications more scalable and manageable. The thesis conducts a thorough investigation of the problem, the state of the art, and the related technologies, and proposes an Architecture for Semantic Integration of Data Sources (ASIDS) addressing the semantic heterogeneity issue. It defines a simple mechanism for the interoperability of heterogeneous data sources in order to extract or discover information regardless of their different semantics. The constituent technologies of this architecture include Globus Toolkit (GT4) and OGSA-DAI (Open Grid Service Architecture Data Integration and Access) alongside other web services technologies such as XML (Extensive Markup Language). To show this, the ASIDS architecture was implemented and tested in a realistic setting by building an exemplar application prototype on a HealthGrid (pilot implementation). The study followed an empirical research methodology and was informed by extensive literature surveys and a critical analysis of the relevant technologies and their synergies. The two literature reviews, together with the analysis of the technology background, have provided a good overview of the current Grid and HealthGrid landscape, produced some valuable taxonomies, explored new paths by integrating technologies, and more importantly illuminated the problem and guided the research process towards a promising solution. Yet the primary contribution of this research is an approach that uses contemporary Grid technologies for integrating heterogeneous data resources that have semantically different. data fields (attributes). It has been practically demonstrated (using a prototype HealthGrid) that discovery in semantically integrated distributed data sources can be feasible by using mainstream Grid technologies, which have been shown to have some Significant advantages over non-Grid based approaches.
5

Automatic Creation of Researcher’s Competence Profiles Based on Semantic Integration of Heterogeneous Data sources

Khadgi, Vinaya, Wang, Tianyi January 2012 (has links)
The research journals and publications are great source of knowledge produced by the virtue of hard work done by researchers. Several digital libraries have been maintaining the records of such research publications in order for general people and other researchers to find and study the previous work done in the research field they are interested in. In order to make the search criteria effective and easier, all of these digital libraries keep a record/database to store the meta-data of the publications. These meta-data records are generally well design to keep the vital records of the publications/articles, which has the potential to give information about the researcher, their research activities, and hence the competence profile. This thesis work is a study and search of method for building the competence profile of researchers’ base on the records of their publications in the well-known digital libraries. The publications of researchers publish in different publication houses, so, in order to make a complete profile, the data from several of these heterogeneous digital libraries sources have to be integrated semantically. Several of the semantic technologies were studied in order to investigate the challenges of integration of the heterogeneous sources and modeling the researchers’ competence profile .An approach of on-demand profile creation was chosen where user of system could enter some basic name detail of the researcher whose profile is to be created. In this thesis work, Design Science Research methodology was used as the method for research work and to complement this research method with a working artifact, scrum- an agile software development methodology was used to develop a competence profile system as proof of concept.
6

Flexible techniques for heterogeneous XML data retrieval

Sanz Blasco, Ismael 31 October 2007 (has links)
The progressive adoption of XML by new communities of users has motivated the appearance of applications that require the management of large and complex collections, which present a large amount of heterogeneity. Some relevant examples are present in the fields of bioinformatics, cultural heritage, ontology management and geographic information systems, where heterogeneity is not only reflected in the textual content of documents, but also in the presence of rich structures which cannot be properly accounted for using fixed schema definitions. Current approaches for dealing with heterogeneous XML data are, however, mainly focused at the content level, whereas at the structural level only a limited amount of heterogeneity is tolerated; for instance, weakening the parent-child relationship between nodes into the ancestor-descendant relationship. The main objective of this thesis is devising new approaches for querying heterogeneous XML collections. This general objective has several implications: First, a collection can present different levels of heterogeneity in different granularity levels; this fact has a significant impact in the selection of specific approaches for handling, indexing and querying the collection. Therefore, several metrics are proposed for evaluating the level of heterogeneity at different levels, based on information-theoretical considerations. These metrics can be employed for characterizing collections, and clustering together those collections which present similar characteristics. Second, the high structural variability implies that query techniques based on exact tree matching, such as the standard XPath and XQuery languages, are not suitable for heterogeneous XML collections. As a consequence, approximate querying techniques based on similarity measures must be adopted. Within the thesis, we present a formal framework for the creation of similarity measures which is based on a study of the literature that shows that most approaches for approximate XML retrieval (i) are highly tailored to very specific problems and (ii) use similarity measures for ranking that can be expressed as ad-hoc combinations of a set of --basic' measures. Some examples of these widely used measures are tf-idf for textual information and several variations of edit distances. Our approach wraps these basic measures into generic, parametrizable components that can be combined into complex measures by exploiting the composite pattern, commonly used in Software Engineering. This approach also allows us to integrate seamlessly highly specific measures, such as protein-oriented matching functions.Finally, these measures are employed for the approximate retrieval of data in a context of highly structural heterogeneity, using a new approach based on the concepts of pattern and fragment. In our context, a pattern is a concise representations of the information needs of a user, and a fragment is a match of a pattern found in the database. A pattern consists of a set of tree-structured elements --- basically an XML subtree that is intended to be found in the database, but with a flexible semantics that is strongly dependent on a particular similarity measure. For example, depending on a particular measure, the particular hierarchy of elements, or the ordering of siblings, may or may not be deemed to be relevant when searching for occurrences in the database. Fragment matching, as a query primitive, can deal with a much higher degree of flexibility than existing approaches. In this thesis we provide exhaustive and top-k query algorithms. In the latter case, we adopt an approach that does not require the similarity measure to be monotonic, as all previous XML top-k algorithms (usually based on Fagin's algorithm) do. We also presents two extensions which are important in practical settings: a specification for the integration of the aforementioned techniques into XQuery, and a clustering algorithm that is useful to manage complex result sets.All of the algorithms have been implemented as part of ArHeX, a toolkit for the development of multi-similarity XML applications, which supports fragment-based queries through an extension of the XQuery language, and includes graphical tools for designing similarity measures and querying collections. We have used ArHeX to demonstrate the effectiveness of our approach using both synthetic and real data sets, in the context of a biomedical research project.
7

Retrieving information from heterogeneous freight data sources to answer natural language queries

Seedah, Dan Paapanyin Kofi 09 February 2015 (has links)
The ability to retrieve accurate information from databases without an extensive knowledge of the contents and organization of each database is extremely beneficial to the dissemination and utilization of freight data. The challenges, however, are: 1) correctly identifying only the relevant information and keywords from questions when dealing with multiple sentence structures, and 2) automatically retrieving, preprocessing, and understanding multiple data sources to determine the best answer to user’s query. Current named entity recognition systems have the ability to identify entities but require an annotated corpus for training which in the field of transportation planning does not currently exist. A hybrid approach which combines multiple models to classify specific named entities was therefore proposed as an alternative. The retrieval and classification of freight related keywords facilitated the process of finding which databases are capable of answering a question. Values in data dictionaries can be queried by mapping keywords to data element fields in various freight databases using ontologies. A number of challenges still arise as a result of different entities sharing the same names, the same entity having multiple names, and differences in classification systems. Dealing with ambiguities is required to accurately determine which database provides the best answer from the list of applicable sources. This dissertation 1) develops an approach to identify and classifying keywords from freight related natural language queries, 2) develops a standardized knowledge representation of freight data sources using an ontology that both computer systems and domain experts can utilize to identify relevant freight data sources, and 3) provides recommendations for addressing ambiguities in freight related named entities. Finally, the use of knowledge base expert systems to intelligently sift through data sources to determine which ones provide the best answer to a user’s question is proposed. / text
8

Type-safe Computation with Heterogeneous Data

Huang, Freeman Yufei 14 September 2007 (has links)
Computation with large-scale heterogeneous data typically requires universal traversal to search for all occurrences of a substructure that matches a possibly complex search pattern, whose context may be different in different places within the data. Both aspects cause difficulty for existing general-purpose programming languages, because these languages are designed for homogeneous data and have problems typing the different substructures in heterogeneous data, and the complex patterns to match with the substructures. Programmers either have to hard-code the structures and search patterns, preventing programs from being reusable and scalable, or have to use low-level untyped programming or programming with special-purpose query languages, opening the door to type mismatches that cause a high risk of program correctness and security problems. This thesis invents the concept of pattern structures, and proposes a general solution to the above problems - a programming technique using pattern structures. In this solution, well-typed pattern structures are defined to represent complex search patterns, and pattern searching over heterogeneous data is programmed with pattern parameters, in a statically-typed language that supports first-class typing of structures and patterns. The resulting programs are statically-typed, highly reusable for different data structures and different patterns, and highly scalable in terms of the complexity of data structures and patterns. Adding new kinds of patterns for an application no longer requires changing the language in use or creating new ones, but is only a programming task. The thesis demonstrates the application of this approach to, and its advantages in, two important examples of computation with heterogeneous data, i.e., XML data processing and Java bytecode analysis. / Thesis (Ph.D, Computing) -- Queen's University, 2007-08-27 09:43:38.888
9

Contextual Outlier Detection from Heterogeneous Data Sources

Yan, Yizhou 17 May 2020 (has links)
The dissertation focuses on detecting contextual outliers from heterogeneous data sources. Modern sensor-based applications such as Internet of Things (IoT) applications and autonomous vehicles are generating a huge amount of heterogeneous data including not only the structured multi-variate data points, but also other complex types of data such as time-stamped sequence data and image data. Detecting outliers from such data sources is critical to diagnose and fix malfunctioning systems, prevent cyber attacks, and save human lives. The outlier detection techniques in the literature typically are unsupervised algorithms with a pre-defined logic, such as, to leverage the probability density at each point to detect outliers. Our analysis of the modern applications reveals that this rigid probability density-based methodology has severe drawbacks. That is, low probability density objects are not necessarily outliers, while the objects with relatively high probability densities might in fact be abnormal. In many cases, the determination of the outlierness of an object has to take the context in which this object occurs into consideration. Within this scope, my dissertation focuses on four research innovations, namely techniques and system for scalable contextual outlier detection from multi-dimensional data points, contextual outlier pattern detection from sequence data, contextual outlier image detection from image data sets, and lastly an integrative end-to-end outlier detection system capable of doing automatic outlier detection, outlier summarization and outlier explanation. 1. Scalable Contextual Outlier Detection from Multi-dimensional Data. Mining contextual outliers from big datasets is a computational expensive process because of the complex recursive kNN search used to define the context of each point. In this research, leveraging the power of distributed compute clusters, we design distributed contextual outlier detection strategies that optimize the key factors determining the efficiency of local outlier detection, namely, to localize the kNN search while still ensuring the load balancing. 2. Contextual Outlier Detection from Sequence Data. For big sequence data, such as messages exchanged between devices and servers and log files measuring complex system behaviors over time, outliers typically occur as a subsequence of symbolic values (or sequential pattern), in which each individual value itself may be completely normal. However, existing sequential pattern mining semantics tend to mis-classify outlier patterns as typical patterns due to ignoring the context in which the pattern occurs. In this dissertation, we present new context-aware pattern mining semantics and then design efficient mining strategies to support these new semantics. In addition, methodologies that continuously extract these outlier patterns from sequence streams are also developed. 3. Contextual Outlier Detection from Image Data. An image classification system not only needs to accurately classify objects from target classes, but also should safely reject unknown objects that belong to classes not present in the training data. Here, the training data defines the context of the classifier and unknown objects then correspond to contextual image outliers. Although the existing Convolutional Neural Network (CNN) achieves high accuracy when classifying known objects, the sum operation on multiple features produced by the convolutional layers causes an unknown object being classified to a target class with high confidence even if it matches some key features of a target class only by chance. In this research, we design an Unknown-aware Deep Neural Network (UDN for short) to detect contextual image outliers. The key idea of UDN is to enhance existing Convolutional Neural Network (CNN) to support a product operation that models the product relationship among the features produced by convolutional layers. This way, missing a single key feature of a target class will greatly reduce the probability of assigning an object to this class. To further improve the performance of our UDN at detecting contextual outliers, we propose an information-theoretic regularization strategy that incorporates the objective of rejecting unknowns into the learning process of UDN. 4. An End-to-end Integrated Outlier Detection System. Although numerous detection algorithms proposed in the literature, there is no one approach that brings the wealth of these alternate algorithms to bear in an integrated infrastructure to support versatile outlier discovery. In this work, we design the first end-to-end outlier detection service that integrates outlier-related services including automatic outlier detection, outlier summarization and explanation, human guided outlier detector refinement within one integrated outlier discovery paradigm. Experimental studies including performance evaluation and user studies conducted on benchmark outlier detection datasets and real world datasets including Geolocation, Lighting, MNIST, CIFAR and the Log file datasets confirm both the effectiveness and efficiency of the proposed approaches and systems.
10

An Application of Cluster Analysis in Identifying and Evaluating Prognostic Subgroups for Therapy-Related Acute Myeloid Leukemia

Antonilli, Stefanie January 2022 (has links)
Treatment for lymphoma with alkylating therapy is known to increase the risk of secondary malignancies such as Acute Myeloid Leukemia (AML), although the risk is not fully understood. This study investigates the characteristics of AML that arise after lymphoma treatment in contrastto AML cases without a prior lymphoma. The study population consists of 115 individuals identified from the Swedish lymphoma register (SLR) with a diagnosis in the quality register for AML between 2000-2019, matched 1:1 to lymphoma-free comparators. A hierarchical clusteranalysis with Gower’s similarity measure and the k-prototypes clustering algorithm are employed to separately identify subgroups of those with a lymphoma history and the matched comparators. The survival of lymphoma patients is compared between subgroups in a Cox regression model. The findings suggests a two-cluster partition achieved by the hierarchical method for patients with a lymphoma history as well as for lymphoma-free patients (average Silhouette 0.853 and0.842, respectively). Both partitions completely separates patients with genetic information from those without. For AML patients with a preceding lymphoma, a subgroup defined by the hierarchical two-cluster partition is associated with an increased mortality rate (HR 2.40). A three-cluster partition achieved by the k-prototypes algorithm could be more clinically relevant, however only one subgroup is associated with increased mortality (HR 2.73).

Page generated in 0.0848 seconds