251 |
Linked Data Quality Assessment and its Application to Societal Progress MeasurementZaveri, Amrapali 19 May 2015 (has links) (PDF)
In recent years, the Linked Data (LD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. Moreover, the semantics and structure of the underlying data are kept intact, making this the Semantic Web. LD essentially entails a set of best practices for publishing and connecting structure data on the Web, which allows publish- ing and exchanging information in an interoperable and reusable fashion. Many different communities on the Internet such as geographic, media, life sciences and government have already adopted these LD principles. This is confirmed by the dramatically growing Linked Data Web, where currently more than 50 billion facts are represented.
With the emergence of Web of Linked Data, there are several use cases, which are possible due to the rich and disparate data integrated into one global information space. Linked Data, in these cases, not only assists in building mashups by interlinking heterogeneous and dispersed data from multiple sources but also empowers the uncovering of meaningful and impactful relationships. These discoveries have paved the way for scientists to explore the existing data and uncover meaningful outcomes that they might not have been aware of previously.
In all these use cases utilizing LD, one crippling problem is the underlying data quality. Incomplete, inconsistent or inaccurate data affects the end results gravely, thus making them unreliable. Data quality is commonly conceived as fitness for use, be it for a certain application or use case. There are cases when datasets that contain quality problems, are useful for certain applications, thus depending on the use case at hand. Thus, LD consumption has to deal with the problem of getting the data into a state in which it can be exploited for real use cases. The insufficient data quality can be caused either by the LD publication process or is intrinsic to the data source itself.
A key challenge is to assess the quality of datasets published on the Web and make this quality information explicit. Assessing data quality is particularly a challenge in LD as the underlying data stems from a set of multiple, autonomous and evolving data sources. Moreover, the dynamic nature of LD makes assessing the quality crucial to measure the accuracy of representing the real-world data. On the document Web, data quality can only be indirectly or vaguely defined, but there is a requirement for more concrete and measurable data quality metrics for LD. Such data quality metrics include correctness of facts wrt. the real-world, adequacy of semantic representation, quality of interlinks, interoperability, timeliness or consistency with regard to implicit information. Even though data quality is an important concept in LD, there are few methodologies proposed to assess the quality of these datasets.
Thus, in this thesis, we first unify 18 data quality dimensions and provide a total of 69 metrics for assessment of LD. The first methodology includes the employment of LD experts for the assessment. This assessment is performed with the help of the TripleCheckMate tool, which was developed specifically to assist LD experts for assessing the quality of a dataset, in this case DBpedia. The second methodology is a semi-automatic process, in which the first phase involves the detection of common quality problems by the automatic creation of an extended schema for DBpedia. The second phase involves the manual verification of the generated schema axioms. Thereafter, we employ the wisdom of the crowds i.e. workers for online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) to assess the quality of DBpedia. We then compare the two approaches (previous assessment by LD experts and assessment by MTurk workers in this study) in order to measure the feasibility of each type of the user-driven data quality assessment methodology.
Additionally, we evaluate another semi-automated methodology for LD quality assessment, which also involves human judgement. In this semi-automated methodology, selected metrics are formally defined and implemented as part of a tool, namely R2RLint. The user is not only provided the results of the assessment but also specific entities that cause the errors, which help users understand the quality issues and thus can fix them. Finally, we take into account a domain-specific use case that consumes LD and leverages on data quality. In particular, we identify four LD sources, assess their quality using the R2RLint tool and then utilize them in building the Health Economic Research (HER) Observatory. We show the advantages of this semi-automated assessment over the other types of quality assessment methodologies discussed earlier. The Observatory aims at evaluating the impact of research development on the economic and healthcare performance of each country per year. We illustrate the usefulness of LD in this use case and the importance of quality assessment for any data analysis.
|
252 |
A Study of Characters in Chinese and Japanese, including Semantic ShiftFan, Jiageng January 2014 (has links)
This thesis examines characters in Chinese and Japanese, including semantic shift. The writing system in China, Japan and a number of other nations whose script relates to characters, notably Korea, will also be discussed. By examining this "Character Cultural Sphere" in East Asia along with the historical and modern character standardizations and reformations, the role of Chinese characters proves to be essential. Furthermore, the thesis investigates semantic shifts of characters as windows on socio-cultural change in two given areas, namely "disorder" to "order" and "natural" to "artificial, manmade". One major aim is to explore shifts of meanings (semantic shifts), that can provide a commentary on the changes in societal and cultural values. The results reveal that the pattern of semantic shifts between China and Japan is considerably similar. Regarding "natural vs manmade" the overall trend shows that in both China and Japan, more characters acquired the meaning of "artificial, manmade" as time goes by, reflecting the changes in society. Regarding "disorder vs order", while the percentage of characters relating to "disorder" remained relatively stable in these two countries, the percentage of characters relating to "order" saw an undeniable increase - more than double in both Chinese and Japanese - showing that in both countries, the overall societal trend was obviously towards more "order" while "disorder" continues to exist. These results give quantitative data regarding the pattern of evolution of Chinese and Japanese societies, particularly Chinese, and provided an insight through written scripts into the evolution of human beings and civilizations.
Also, because of its length, the main database of the research, the table of 2,500 common-use characters with commentary, is attached after the bibliography as an appendix.
|
253 |
SemDQ: A Semantic Framework for Data Quality AssessmentZhu, Lingkai January 2014 (has links)
Objective:
Access to, and reliance upon, high quality data is an enabling cornerstone of modern health delivery systems. Sadly, health systems are often awash with poor quality data which contributes both to adverse outcomes and can compromise the search for new knowledge. Traditional approaches to purging poor data from health information systems often require manual, laborious and time-consuming procedures at the collection, sanitizing and processing stages of the information life cycle with results that often remain sub-optimal. A promising solution may lie with semantic technologies - a family of computational standards and algorithms capable of expressing and deriving the meaning of data elements. Semantic approaches purport to offer the ability to represent clinical knowledge in ways that can support complex searching and reasoning tasks. It is argued that this ability offers exciting promise as a novel approach to assessing and improving data quality. This study examines the effectiveness of semantic web technologies as a mechanism by which high quality data can be collected and assessed in health settings. To make this assessment, key study objectives include determining the ability to construct of valid semantic data model that sufficiently expresses the complexity present in the data as well as the development of a comprehensive set of validation rules that can be applied semantically to test the effectiveness of the proposed semantic framework.
Methods:
The Semantic Framework for Data Quality Assessment (SemDQ) was designed. A core component of the framework is an ontology representing data elements and their relationships in a given domain. In this study, the ontology was developed using openEHR standards with extensions to capture data elements used in for patient care and research purposes in a large organ transplant program. Data quality dimensions were defined and corresponding criteria for assessing data quality were developed for each dimension. These criteria were then applied using semantic technology to an anonymized research dataset containing medical data on transplant patients. Results were validated by clinical researchers. Another test was performed on a simulated dataset with the same attributes as the research dataset to confirm the computational accuracy and effectiveness of the framework.
Results:
A prototype of SemDQ was successfully implemented, consisting of an ontological model integrating the openEHR reference model, a vocabulary of transplant variables and a set of data quality dimensions. Thirteen criteria in three data quality dimensions were transformed into computational constructs using semantic web standards. Reasoning and logic inconsistency checking were first performed on the simulated dataset, which contains carefully constructed test cases to ensure the correctness and completeness of logical computation. The same quality checking algorithms were applied to an established research database. Data quality defects were successfully identified in the dataset which was manually cleansed and validated periodically. Among the 103,505 data entries, application of two criteria did not return any error, while eleven of the criteria detected erroneous or missing data, with the error rates ranging from 0.05% to 79.9%. Multiple review sessions were held with clinical researchers to verify the results. The SemDQ framework was refined to reflect the intricate clinical knowledge. Data corrections were implemented in the source dataset as well as in the clinical system used in the transplant program resulting in improved quality of data for both clinical and research purposes.
Implications:
This study demonstrates the feasibility and benefits of using semantic technologies in data quality assessment processes. SemDQ is based on semantic web standards which allows easy reuse of rules and leverages generic reasoning engines for computation purposes. This mechanism avoids the shortcomings that come with proprietary rule engines which often make ruleset and knowledge developed for one dataset difficult to reuse in different datasets, even in a similar clinical domain. SemDQ can implement rules that have shown to have a greater capacity of detect complex cross-reference logic inconsistencies. In addition, the framework allows easy extension of knowledge base to cooperate more data types and validation criteria. It has the potential to be incorporated into current workflow in clinical care setting to reduce data errors during the process of data capture.
|
254 |
An ontology for enhancing automation and interoperability in Enterprise Crowdsourcing EnvironmentsHetmank, Lars 17 November 2014 (has links) (PDF)
Enterprise crowdsourcing transforms the way in which traditional business tasks can be processed by harnessing the collective intelligence and workforce of a large and often diver-sified group of people. At the present time, data and information residing within enterprise crowdsourcing systems and other business applications are insufficiently interlinked and are rarely made publicly available in an open and semantically structured manner – neither to the corporate intranet nor to the World Wide Web (WWW). However, the semantic annotation of enterprise crowdsourcing activities is a promising research and application domain. The Semantic Web and its related technologies, methods and principles for publishing structured data offer an extension of the traditional layout-oriented Web to provide more intelligent and complex services.
This technical report describes the efforts toward a universal and lightweight yet powerful Semantic Web vocabulary for the domain of enterprise crowdsourcing. As a methodology for developing the vocabulary, the approach of ontology engineering is applied. To illustrate the purpose and to limit the scope of the ontology, several informal competency questions as well as functional and non-functional requirements are presented. The subsequent con-ceptualization of the ontology applies different sources of knowledge and considers various perspectives. A set of semantic entities is derived from a review of existing crowdsourcing applications and a review of recent crowdsourcing literature. During the domain capture, all partial results of the review are integrated into a consistent data dictionary and structured as a UML data schema. The designed ontology includes 24 classes, 22 object properties and 30 datatype properties to describe the key aspects of a crowdsourcing model (CSM). To demonstrate the technical feasibility, the ontology is implemented using the Web Ontology Language (OWL). Finally, the ontology is evaluated by means of transforming informal to formal competency questions, comparing it to existing semantic vocabularies, and calculat-ing ontology metrics. Evidence is shown that the CSM ontology covers the key representa-tional needs of the enterprise crowdsourcing domain. At the end of the technical report, cur-rent limitations are illustrated and directions for future research are proposed.
|
255 |
Ubiquitous Semantic ApplicationsErmilov, Timofey 14 January 2015 (has links) (PDF)
As Semantic Web technology evolves many open areas emerge, which attract more research focus. In addition to quickly expanding Linked Open Data (LOD) cloud, various embeddable metadata formats (e.g. RDFa, microdata) are becoming more common. Corporations are already using existing Web of Data to create new technologies that were not possible before. Watson by IBM an artificial intelligence computer system capable of answering questions posed in natural language can be a great example.
On the other hand, ubiquitous devices that have a large number of sensors and integrated devices are becoming increasingly powerful and fully featured computing platforms in our pockets and homes. For many people smartphones and tablet computers have already replaced traditional computers as their window to the Internet and to the Web. Hence, the management and presentation of information that is useful to a user is a main requirement for today’s smartphones. And it is becoming extremely important to provide access to the emerging Web of Data from the ubiquitous devices.
In this thesis we investigate how ubiquitous devices can interact with the Semantic Web. We discovered that there are five different approaches for bringing the Semantic Web to ubiquitous devices. We have outlined and discussed in detail existing challenges in implementing this approaches in section 1.2. We have described a conceptual framework for ubiquitous semantic applications in chapter 4. We distinguish three client approaches for accessing semantic data using ubiquitous devices depending on how much of the semantic data processing is performed on the device itself (thin, hybrid and fat clients). These are discussed in chapter 5 along with the solution to every related challenge. Two provider approaches (fat and hybrid) can be distinguished for exposing data from ubiquitous devices on the Semantic Web. These are discussed in chapter 6 along with the solution to every related challenge. We conclude our work with a discussion on each of the contributions of the thesis and propose future work for each of the discussed approach in chapter 7.
|
256 |
Using known schemas and mappings to construct new semantic mappings /Madhavan, Jayant. January 2005 (has links)
Thesis (Ph. D.)--University of Washington, 2005. / Vita. Includes bibliographical references (p. 145-158).
|
257 |
OntoFeed um leitor de Feeds com extensão ontológica. / Ontofeed: a feed reader with ontological extension.Marcelo Gomes Rodrigues 23 August 2011 (has links)
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O problema que justifica o presente estudo refere-se à falta de semântica nos mecanismos de busca na Web. Para este problema, o consórcio W3 vem desenvolvendo tecnologias que visam construir uma Web Semântica. Entre estas tecnologias, estão as ontologias de domínio. Neste sentido, o objetivo geral desta dissertação é discutir as possibilidades de se imprimir semântica às buscas nos agregadores de notícia da Web. O objetivo específico é apresentar uma aplicação que usa uma classificação semi-automática de notícias, reunindo, para tanto, as tecnologias de busca da área de recuperação de informação com as ontologias de domínio. O sistema proposto é uma aplicação para a Web capaz de buscar notícias sobre um domínio específico em portais de informação. Ela utiliza a API do Google Maps V1 para a localização georreferenciada da notícia, sempre que esta informação estiver disponível. Para mostrar a viabilidade da proposta, foi desenvolvido um exemplo apoiado em uma ontologia para o domínio de chuvas e suas consequências. Os resultados obtidos por este novo Feed de base ontológica são alocados em um banco de dados e disponibilizados para consulta via Web. A expectativa é que o Feed proposto seja mais relevante em seus resultados do que um Feed comum. Os resultados obtidos com a união de tecnologias patrocinadas pelo consórcio W3 (XML, RSS e ontologia) e ferramentas de busca em página Web foram satisfatórios para o propósito pretendido. As ontologias mostram-se como ferramentas de usos múltiplos, e seu valor de análise em buscas na Web pode ser ampliado com aplicações computacionais adequadas para cada caso. Como no exemplo apresentado nesta dissertação, à palavra chuva agregaram-se outros conceitos, que estavam presentes nos desdobramentos ocasionados por ela. Isto realçou a ligação do evento chuva com as consequências que ela provoca - ação que só foi possível executar através de um recorte do conhecimento formal envolvido. / The problem addressed in this work refers to the lack of semantics in Web search engine. As solution, the W3 consortium has been developing technologies that aim to build a Semantic Web, including the domain ontology. Considering this issue, the work main goal is to discuss the possibilities of placing semantics context in the searches in Web feed applications. The specific goal is to propose a Web application that uses a semi-automatic classification of news, by joining information retrieval technologies and domain ontology. The software is able to get news about a given domain from Web information portals. It uses the Google Map API VI for gather the new geo-referenced location, whenever this information is available. To show the proposal feasibility, an example was developed supported by an ontology in the domain of rainfall and its consequences. The results of this new ontology-based feed are allocated in a database e make available for query via the Web. It is expected that the proposed feed offers more relevant results than the current feeds. In addition, the union of technologies sponsored by the W3C and traditional search methods on Web pages were satisfactory for the intended purposes. Ontology is showed as multi-use tool and its value in Web search can be extended for appropriate computer applications. In the example presented, other concepts were added to the word rainfall, which is present in the deployments caused by it. This highlighted the connection of the event rainfall with its consequences, action that was only possible to run through a cutout of the formal knowledge involved.
|
258 |
OntoFeed um leitor de Feeds com extensão ontológica. / Ontofeed: a feed reader with ontological extension.Marcelo Gomes Rodrigues 23 August 2011 (has links)
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O problema que justifica o presente estudo refere-se à falta de semântica nos mecanismos de busca na Web. Para este problema, o consórcio W3 vem desenvolvendo tecnologias que visam construir uma Web Semântica. Entre estas tecnologias, estão as ontologias de domínio. Neste sentido, o objetivo geral desta dissertação é discutir as possibilidades de se imprimir semântica às buscas nos agregadores de notícia da Web. O objetivo específico é apresentar uma aplicação que usa uma classificação semi-automática de notícias, reunindo, para tanto, as tecnologias de busca da área de recuperação de informação com as ontologias de domínio. O sistema proposto é uma aplicação para a Web capaz de buscar notícias sobre um domínio específico em portais de informação. Ela utiliza a API do Google Maps V1 para a localização georreferenciada da notícia, sempre que esta informação estiver disponível. Para mostrar a viabilidade da proposta, foi desenvolvido um exemplo apoiado em uma ontologia para o domínio de chuvas e suas consequências. Os resultados obtidos por este novo Feed de base ontológica são alocados em um banco de dados e disponibilizados para consulta via Web. A expectativa é que o Feed proposto seja mais relevante em seus resultados do que um Feed comum. Os resultados obtidos com a união de tecnologias patrocinadas pelo consórcio W3 (XML, RSS e ontologia) e ferramentas de busca em página Web foram satisfatórios para o propósito pretendido. As ontologias mostram-se como ferramentas de usos múltiplos, e seu valor de análise em buscas na Web pode ser ampliado com aplicações computacionais adequadas para cada caso. Como no exemplo apresentado nesta dissertação, à palavra chuva agregaram-se outros conceitos, que estavam presentes nos desdobramentos ocasionados por ela. Isto realçou a ligação do evento chuva com as consequências que ela provoca - ação que só foi possível executar através de um recorte do conhecimento formal envolvido. / The problem addressed in this work refers to the lack of semantics in Web search engine. As solution, the W3 consortium has been developing technologies that aim to build a Semantic Web, including the domain ontology. Considering this issue, the work main goal is to discuss the possibilities of placing semantics context in the searches in Web feed applications. The specific goal is to propose a Web application that uses a semi-automatic classification of news, by joining information retrieval technologies and domain ontology. The software is able to get news about a given domain from Web information portals. It uses the Google Map API VI for gather the new geo-referenced location, whenever this information is available. To show the proposal feasibility, an example was developed supported by an ontology in the domain of rainfall and its consequences. The results of this new ontology-based feed are allocated in a database e make available for query via the Web. It is expected that the proposed feed offers more relevant results than the current feeds. In addition, the union of technologies sponsored by the W3C and traditional search methods on Web pages were satisfactory for the intended purposes. Ontology is showed as multi-use tool and its value in Web search can be extended for appropriate computer applications. In the example presented, other concepts were added to the word rainfall, which is present in the deployments caused by it. This highlighted the connection of the event rainfall with its consequences, action that was only possible to run through a cutout of the formal knowledge involved.
|
259 |
Context-Aware Adaptive Hybrid Semantic Relatedness in Biomedical ScienceJanuary 2016 (has links)
abstract: Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems such as relationship extraction, ontology creation and question / answering [1–6]. Several techniques exist in calculating semantic relatedness of two concepts. These techniques utilize different knowledge sources and corpora. So far, researchers attempted to find the best hybrid method for each domain by combining semantic relatedness techniques and data sources manually. In this work, attempts were made to eliminate the needs for manually combining semantic relatedness methods targeting any new contexts or resources through proposing an automated method, which attempted to find the best combination of semantic relatedness techniques and resources to achieve the best semantic relatedness score in every context. This may help the research community find the best hybrid method for each context considering the available algorithms and resources. / Dissertation/Thesis / Doctoral Dissertation Biomedical Informatics 2016
|
260 |
Video2Vec: Learning Semantic Spatio-Temporal Embedding for Video RepresentationsJanuary 2016 (has links)
abstract: High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.
Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.
In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
Page generated in 0.0477 seconds