• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 164
  • 65
  • 20
  • 15
  • 11
  • 7
  • 4
  • 4
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 332
  • 332
  • 70
  • 48
  • 48
  • 45
  • 38
  • 36
  • 35
  • 34
  • 32
  • 31
  • 31
  • 31
  • 29
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
131

An APIfication Approach to Facilitate the Access and Reuse of Open Data

González Mora, César 10 September 2021 (has links)
Nowadays, there is a tendency to publish data on the Web, due to the benefits it brings to the society and the new legislation that encourage the opening of data. These collections of open data, also known as datasets, are typically published in open data portals by governments and institutions around the world in order to make it open -- available on the Web in a free and reusable manner. The common behaviour tends to be that publishers expose their data as individual tabular datasets. Open data is considered highly valuable because promoting the use of public information produces transparency, innovation and other social, political and economic benefits. Especially, this importance is also considerable in situational scenarios, where a small group of consumers (developers or data scientists) with specific needs require thematic data for a short life cycle. In order that these data consumers easily assess whether the data is adequate for their purpose there are different mechanisms. For example, SPARQL endpoints have become very useful for the consumption of open data, and particularly, Linked Open Data (LOD). Moreover, in order to access open data in a straightforward manner, Web Application Programming Interfaces (APIs) are also highly recommended feature of open data portals. However, accessing this open data is a difficult task since current Open Data platforms do not generally provide suitable strategies to access their data. On the one hand, accessing open data through SPARQL endpoints is a difficult task because it requires knowledge in different technologies, which is challenging especially for novice developers. Moreover, LOD is not usually available since most used formats in open data government portals are tabular. On the other hand, although providing Web APIs would facilitate developers to easily access open data for reusing it from open data portals’ catalogs, there is a lack of suitable Web APIs in open data portals. Moreover, in most cases, the currently available APIs only allow to access catalog’s metadata or to download entire data resources (i.e. coarse-grain access to data), hampering the reuse of data. In addition, as the open data is commonly published individually without considering potential relationships with other datasets, reusing several open datasets together is not a trivial task, thus requiring mechanisms that allow data consumers to integrate and access tabular open data published on the Web. Therefore, open data is not being used to its full potential because it is not easily accessible. As the access to open data is thus still limited for end-users, particularly those without programming skills, we propose a model-based approach to automatically generate Web APIs from open data. This APIfication approach takes into account the access to multiple integrated tabular datasets and the consumption of data in situational scenarios. Firstly, we focus on data that can be integrated by means of join and union operations. Then, we coin the term disposable Web APIs as an alternative mechanism for the consumption of open data in situational scenarios. These disposable Web APIs are created on-the-fly to be used temporarily by a user to consume specific open data. Accordingly, the main objective is to provide suitable mechanisms to easily access and reuse open data on the fly and in an integrated manner, solving the problem of difficult access through SPARQL endpoints for most data consumers and the lack of suitable Web APIs with easy access to open data. With this approach, we address both open data publishers and consumers, as long as the publishers will be able to include a Web API within their data, and data consumers or reusers will be benefited in those cases that a Web API pointing to the open data is missing. The results of the experiments conducted led us to conclude that users consider our generated Web APIs as easy to use, providing the desired open data, even though coming from different datasets and especially in situational scenarios. / Esta tesis ha sido financiada por la Universidad de Alicante mediante un contrato destinado a la formación predoctoral, y por la Generalitat Valenciana mediante una subvención para la contratación de personal investigador de carácter predoctoral (ACIF2019).
132

Kombinace laserových a snímkových dat z mobilního mapovacího systému / Combination of laser and image data from a mobile mapping system

Stránská, Petra January 2021 (has links)
The diploma thesis describes the data integration of data from different 3D technologies, specifically data of close range photogrammetry, aerial photogrammetry using RPAS and terrestrial laser scanning. The thesis deals mainly with fotogrammetric processing in ContextCapture software and data integration in this software. The thesis also describes a construction of a calibration field. The points of the field were used as ground control points and check points during processing. The accuracy of the outputs was evaluated by statistical testing of the coordinate deviations of the control points. The result of the thesis is 3D model of one of the buildings located in the AdMaS research center.
133

Dimension reduction methods for nonlinear association analysis with applications to omics data

Wu, Peitao 06 November 2021 (has links)
With advances in high-throughput techniques, the availability of large-scale omics data has revolutionized the fields of medicine and biology, and has offered a better understanding of the underlying biological mechanisms. However, the high-dimensionality and the unknown association structure between different data types make statistical integration analyses challenging. In this dissertation, we develop three dimensionality reduction methods to detect nonlinear association structure using omics data. First, we propose a method for variable selection in a nonparametric additive quantile regression framework. We enforce a network regularization to incorporate information encoded by known networks. To account for nonlinear associations, we approximate the additive functional effect of each predictor with the expansion of a B-spline basis. We implement the group Lasso penalty to achieve sparsity. We define the network-constrained penalty by regulating the difference between the effect functions of any two linked genes (predictors) in the network. Simulation studies show that our proposed method performs well in identifying truly associated genes with fewer falsely associated genes than alternative approaches. Second, we develop a canonical correlation analysis (CCA)-based method, canonical distance correlation analysis (CDCA), and leverage the distance correlation to capture the overall association between two sets of variables. The CDCA allows untangling linear and nonlinear dependence structures. Third, we develop the sparse CDCA (sCDCA) method to achieve sparsity and improve result interpretability by adding penalties on the loadings from the CDCA. The sCDCA method can be applied to data with large dimensionality and small sample size. We develop iterative majorization-minimization-based coordinate descent algorithms to compute the loadings in the CDCA and sCDCA methods. Simulation studies show that the proposed CDCA and sCDCA approaches have better performance than classical CCA and sparse CCA (sCCA) in nonlinear settings and have similar performance in linear association settings. We apply the proposed methods to the Framingham Heart Study (FHS) to identify body mass index associated genes, the association structure between metabolic disorders and metabolite profiles, and a subset of metabolites and their associated type 2 diabetes (T2D)-related genes. / 2023-11-05T00:00:00Z
134

The Basics of Complex Correspondences and Functions and their Implementation and Semi-automatic Detection in COMA++

Arnold, Patrick 26 February 2018 (has links)
In der vorliegenden Masterarbeit wird erläutert, wie ein klassischer Schema Matcher erweitert wird, um Komplexe Korrespondenzen (many-to-many-Korrespondenzen) und allgemeine Funktionen zwischen zwei Schemata auszudrücken, sowie deren automatische Entdeckung als Erweiterung der herkömmlichen Entdeckung von (1:1)-Korrespondenzen. Der letzte Punkt widmet sich dabei einem Gebiet der Datenintegration, das bisher kaum untersucht wurde, und es werden Ansätze vorgestellt, die für viele Schema Matcher eine Bereicherung darstellen können. Zu diesem Zweckwerden im ersten Teil der Arbeit Komplexe Korrespondenzen und Funktionen im Bereich des Schema Mappings ausführlich vorgestellt.
135

Integrade Linked Data / Linked Data Integration

Michelfeit, Jan January 2013 (has links)
Linked Data have emerged as a successful publication format which could mean to structured data what Web meant to documents. The strength of Linked Data is in its fitness for integration of data from multiple sources. Linked Data integration opens door to new opportunities but also poses new challenges. New algorithms and tools need to be developed to cover all steps of data integration. This thesis examines the established data integration proceses and how they can be applied to Linked Data, with focus on data fusion and conflict resolution. Novel algorithms for Linked Data fusion are proposed and the task of supporting trust with provenance information and quality assessment of fused data is addressed. The proposed algorithms are implemented as part of a Linked Data integration framework ODCleanStore.
136

Scalable Data Integration for Linked Data

Nentwig, Markus 06 August 2020 (has links)
Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster.
137

On Pattern Mining in Graph Data to Support Decision-Making

Petermann, André 14 February 2019 (has links)
In recent years graph data models became increasingly important in both research and industry. Their core is a generic data structure of things (vertices) and connections among those things (edges). Rich graph models such as the property graph model promise an extraordinary analytical power because relationships can be evaluated without knowledge about a domain-specific database schema. This dissertation studies the usage of graph models for data integration and data mining of business data. Although a typical company's business data implicitly describes a graph it is usually stored in multiple relational databases. Therefore, we propose the first semi-automated approach to transform data from multiple relational databases into a single graph whose vertices represent domain objects and whose edges represent their mutual relationships. This transformation is the base of our conceptual framework BIIIG (Business Intelligence with Integrated Instance Graphs). We further proposed a graph-based approach to data integration. The process is executed after the transformation. In established data mining approaches interrelated input data is mostly represented by tuples of measure values and dimension values. In the context of graphs these values must be attached to the graph structure and aggregated measure values are graph attributes. Since the latter was not supported by any existing model, we proposed the use of collections of property graphs. They act as data structure of the novel Extended Property Graph Model (EPGM). The model supports vertices and edges that may appear in different graphs as well as graph properties. Further on, we proposed some operators that benefit from this data structure, for example, graph-based aggregation of measure values. A primitive operation of graph pattern mining is frequent subgraph mining (FSM). However, existing algorithms provided no support for directed multigraphs. We extended the popular gSpan algorithm to overcome this limitation. Some patterns might not be frequent while their generalizations are. Generalized graph patterns can be mined by attaching vertices to taxonomies. We proposed a novel approach to Generalized Multidimensional Frequent Subgraph Mining (GM-FSM), in particular the first solution to generalized FSM that supports not only directed multigraphs but also multiple dimensional taxonomies. In scenarios that compare patterns of different categories, e.g., fraud or not, FSM is not sufficient since pattern frequencies may differ by category. Further on, determining all pattern frequencies without frequency pruning is not an option due to the computational complexity of FSM. Thus, we developed an FSM extension to extract patterns that are characteristic for a specific category according to a user-defined interestingness function called Characteristic Subgraph Mining (CSM). Parts of this work were done in the context of GRADOOP, a framework for distributed graph analytics. To make the primitive operation of frequent subgraph mining available to this framework, we developed Distributed In-Memory gSpan (DIMSpan), a frequent subgraph miner that is tailored to the characteristics of shared-nothing clusters and distributed dataflow systems. Finally, the results of use case evaluations in cooperation with a large scale enterprise will be presented. This includes a report of practical experiences gained in implementation and application of the proposed algorithms.
138

Computationally Linking Chemical Exposure to Molecular Effects with Complex Data: Comparing Methods to Disentangle Chemical Drivers in Environmental Mixtures and Knowledge-based Deep Learning for Predictions in Environmental Toxicology

Krämer, Stefan 30 May 2022 (has links)
Chemical exposures affect the environment and may lead to adverse outcomes in its organisms. Omics-based approaches, like standardised microarray experiments, have expanded the toolbox to monitor the distribution of chemicals and assess the risk to organisms in the environment. The resulting complex data have extended the scope of toxicological knowledge bases and published literature. A plethora of computational approaches have been applied in environmental toxicology considering systems biology and data integration. Still, the complexity of environmental and biological systems given in data challenges investigations of exposure-related effects. This thesis aimed at computationally linking chemical exposure to biological effects on the molecular level considering sources of complex environmental data. The first study employed data of an omics-based exposure study considering mixture effects in a freshwater environment. We compared three data-driven analyses in their suitability to disentangle mixture effects of chemical exposures to biological effects and their reliability in attributing potentially adverse outcomes to chemical drivers with toxicological databases on gene and pathway levels. Differential gene expression analysis and a network inference approach resulted in toxicologically meaningful outcomes and uncovered individual chemical effects — stand-alone and in combination. We developed an integrative computational strategy to harvest exposure-related gene associations from environmental samples considering mixtures of lowly concentrated compounds. The applied approaches allowed assessing the hazard of chemicals more systematically with correlation-based compound groups. This dissertation presents another achievement toward a data-driven hypothesis generation for molecular exposure effects. The approach combined text-mining and deep learning. The study was entirely data-driven and involved state-of-the-art computational methods of artificial intelligence. We employed literature-based relational data and curated toxicological knowledge to predict chemical-biomolecule interactions. A word embedding neural network with a subsequent feed-forward network was implemented. Data augmentation and recurrent neural networks were beneficial for training with curated toxicological knowledge. The trained models reached accuracies of up to 94% for unseen test data of the employed knowledge base. However, we could not reliably confirm known chemical-gene interactions across selected data sources. Still, the predictive models might derive unknown information from toxicological knowledge sources, like literature, databases or omics-based exposure studies. Thus, the deep learning models might allow predicting hypotheses of exposure-related molecular effects. Both achievements of this dissertation might support the prioritisation of chemicals for testing and an intelligent selection of chemicals for monitoring in future exposure studies.:Table of Contents ... I Abstract ... V Acknowledgements ... VII Prelude ... IX 1 Introduction 1.1 An overview of environmental toxicology ... 2 1.1.1 Environmental toxicology ... 2 1.1.2 Chemicals in the environment ... 4 1.1.3 Systems biological perspectives in environmental toxicology ... 7 Computational toxicology ... 11 1.2.1 Omics-based approaches ... 12 1.2.2 Linking chemical exposure to transcriptional effects ... 14 1.2.3 Up-scaling from the gene level to higher biological organisation levels ... 19 1.2.4 Biomedical literature-based discovery ... 24 1.2.5 Deep learning with knowledge representation ... 27 1.3 Research question and approaches ... 29 2 Methods and Data ... 33 2.1 Linking environmental relevant mixture exposures to transcriptional effects ... 34 2.1.1 Exposure and microarray data ... 34 2.1.2 Preprocessing ... 35 2.1.3 Differential gene expression ... 37 2.1.4 Association rule mining ... 38 2.1.5 Weighted gene correlation network analysis ... 39 2.1.6 Method comparison ... 41 Predicting exposure-related effects on a molecular level ... 44 2.2.1 Input ... 44 2.2.2 Input preparation ... 47 2.2.3 Deep learning models ... 49 2.2.4 Toxicogenomic application ... 54 3 Method comparison to link complex stream water exposures to effects on the transcriptional level ... 57 3.1 Background and motivation ... 58 3.1.1 Workflow ... 61 3.2 Results ... 62 3.2.1 Data preprocessing ... 62 3.2.2 Differential gene expression analysis ... 67 3.2.3 Association rule mining ... 71 3.2.4 Network inference ... 78 3.2.5 Method comparison ... 84 3.2.6 Application case of method integration ... 87 3.3 Discussion ... 91 3.4 Conclusion ... 99 4 Deep learning prediction of chemical-biomolecule interactions ... 101 4.1 Motivation ... 102 4.1.1Workflow ...105 4.2 Results ... 107 4.2.1 Input preparation ... 107 4.2.2 Model selection ... 110 4.2.3 Model comparison ... 118 4.2.4 Toxicogenomic application ... 121 4.2.5 Horizontal augmentation without tail-padding ...123 4.2.6 Four-class problem formulation ... 124 4.2.7 Training with CTD data ... 125 4.3 Discussion ... 129 4.3.1 Transferring biomedical knowledge towards toxicology ... 129 4.3.2 Deep learning with biomedical knowledge representation ...133 4.3.3 Data integration ...136 4.4 Conclusion ... 141 5 Conclusion and Future perspectives ... 143 5.1 Conclusion ... 143 5.1.1 Investigating complex mixtures in the environment ... 144 5.1.2 Complex knowledge from literature and curated databases predict chemical- biomolecule interactions ... 145 5.1.3 Linking chemical exposure to biological effects by integrating CTD ... 146 5.2 Future perspectives ... 147 S1 Supplement Chapter 1 ... 153 S1.1 Example of an estrogen bioassay ... 154 S1.2 Types of mode of action ... 154 S1.3 The dogma of molecular biology ... 157 S1.4 Transcriptomics ... 159 S2 Supplement Chapter 3 ... 161 S3 Supplement Chapter 4 ... 175 S3.1 Hyperparameter tuning results ... 176 S3.2 Functional enrichment with predicted chemical-gene interactions and CTD reference pathway genesets ... 179 S3.3 Reduction of learning rate in a model with large word embedding vectors ... 183 S3.4 Horizontal augmentation without tail-padding ... 183 S3.5 Four-relationship classification ... 185 S3.6 Interpreting loss observations for SemMedDB trained models ... 187 List of Abbreviations ... i List of Figures ... vi List of Tables ... x Bibliography ... xii Curriculum scientiae ... xxxix Selbständigkeitserklärung ... xliii
139

Supporting Scientific Collaboration through Workflows and Provenance

Ellqvist, Tommy January 2010 (has links)
Science is changing. Computers, fast communication, and  new technologies have created new ways of conducting research.  For  instance, researchers from different disciplines are processing and  analyzing scientific data that is increasing at an exponential rate.  This kind of research requires that the scientists have access to  tools that can handle huge amounts of data, enable access to vast  computational resources, and support the collaboration of large  teams of scientists. This thesis focuses on tools that help support  scientific collaboration. Workflows and provenance are two concepts that have proven useful in  supporting scientific collaboration.  Workflows provide a formal  specification of scientific experiments, and provenance offers a  model for documenting data and process dependencies.  Together, they  enable the creation of tools that can support collaboration through  the whole scientific life-cycle, from specification of experiments  to validation of results.  However, existing models for workflows  and provenance are often specific to particular tasks and tools.  This makes it hard to analyze the history of data that has been  generated over several application areas by different tools.  Moreover, workflow design is a time-consuming process and often  requires extensive knowledge of the tools involved and collaboration  with researchers with different expertise. This thesis addresses  these problems. Our first contribution is a study of the differences between two  approaches to interoperability between provenance models: direct  data conversion, and mediation. We perform a case study where we  integrate three different provenance models using the mediation  approach, and show the advantages compared to data conversion.  Our  second contribution serves to support workflow design by allowing  multiple users to concurrently design workflows. Current workflow  tools lack the ability for users to work simultaneously on the same  workflow.  We propose a method that uses the provenance of workflow  evolution to enable real-time collaborative design of workflows.  Our third contribution considers supporting workflow design by  reusing existing workflows. Workflow collections for reuse are  available, but more efficient methods for generating summaries of  search results are still needed. We explore new summarization  strategies that considers the workflow structure. <img src="%3D" />
140

Network Models for Capturing Molecular Feature and Predicting Drug Target for Various Cancers

Liu, Enze 12 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Network-based modeling and analysis have been widely used for capturing molecular trajectories of cellular processes. For complex diseases like cancers, if we can utilize network models to capture adequate features, we can gain a better insight of the mechanism of cancers, which will further facilitate the identification of molecular vulnerabilities and the development targeted therapy. Based on this rationale, we conducted the following four studies: A novel algorithm ‘FFBN’ is developed for reconstructing directional regulatory networks (DEGs) from tissue expression data to identify molecular features. ‘FFBN’ shows unique capability of fast and accurately reconstructing genome-wide DEGs compared to existing methods. FFBN is further used to capture molecular features among liver metastasis, primary liver cancers and primary colon cancers. Comparisons among these features lead to new understandings of how liver metastasis is similar to its primary and distant cancers. ‘SCN’ is a novel algorithm that incorporates multiple types of omics data to reconstruct functional networks for not only revealing molecular vulnerabilities but also predicting drug targets on top of that. The molecular vulnerabilities are discovered via tissue-specific networks and drug targets are predicted via cell-line specific networks. SCN is tested on primary pancreatic cancers and the predictions coincide with current treatment plans. ‘SCN website’ is a web application of ‘SCN’ algorithm. It allows users to easily submit their own data and get predictions online. Meanwhile the predictions are displayed along with network graphs and survival curves. ‘DSCN’ is a novel algorithm derived from ‘SCN’. Instead of predicting single targets like ‘SCN’, ‘DSCN’ applies a novel approach for predicting target combinations using multiple omics data and network models. In conclusion, our studies revealed how genes regulate each other in the form of networks and how these networks can be used for unveiling cancer-related biological processes. Our algorithms and website facilitate capturing molecular features for cancers and predicting novel drug targets.

Page generated in 0.0946 seconds