Global ETD Search

141	Integrade Linked Data / Linked Data Integration Michelfeit, Jan January 2013 (has links) Linked Data have emerged as a successful publication format which could mean to structured data what Web meant to documents. The strength of Linked Data is in its fitness for integration of data from multiple sources. Linked Data integration opens door to new opportunities but also poses new challenges. New algorithms and tools need to be developed to cover all steps of data integration. This thesis examines the established data integration proceses and how they can be applied to Linked Data, with focus on data fusion and conflict resolution. Novel algorithms for Linked Data fusion are proposed and the task of supporting trust with provenance information and quality assessment of fused data is addressed. The proposed algorithms are implemented as part of a Linked Data integration framework ODCleanStore.
142	Multiple Entity Reconciliation Samoila, Lavinia Andreea January 2015 (has links) Living in the age of "Big Data" is both a blessing and a curse. On he one hand, the raw data can be analysed and then used for weather redictions, user recommendations, targeted advertising and more. On he other hand, when data is aggregated from multiple sources, there is no guarantee that each source has stored the data in a standardized or even compatible format to what is required by the application. So there is a need to parse the available data and convert it to the desired form. Here is where the problems start to arise: often the correspondences are not quite so straightforward between data instances that belong to the same domain, but come from different sources. For example, in the film industry, information about movies (cast, characters, ratings etc.) can be found on numerous websites such as IMDb or Rotten Tomatoes. Finding and matching all the data referring to the same movie is a challenge. The aim of this project is to select the most efficient algorithm to correlate movie related information gathered from various websites automatically. We have implemented a flexible application that allows us to make the performance comparison of multiple algorithms based on machine learning techniques. According to our experimental results, a well chosen set of rules is on par with the results from a neural network, these two proving to be the most effective classifiers for records with movie information as content. entity matching data linkage data quality machine learning text processing Computer and Information Sciences Data- och informationsvetenskap
143	Efficient Extraction and Query Benchmarking of Wikipedia Data Morsey, Mohamed 12 April 2013 (has links) Knowledge bases are playing an increasingly important role for integrating information between systems and over the Web. Today, most knowledge bases cover only specific domains, they are created by relatively small groups of knowledge engineers, and it is very cost intensive to keep them up-to-date as domains change. In parallel, Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. The DBpedia (http://dbpedia.org) project makes use of this large collaboratively edited knowledge source by extracting structured content from it, interlinking it with other knowledge bases, and making the result publicly available. DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Furthermore, many companies and researchers use DBpedia and its public services to improve their applications and research approaches. However, the DBpedia release process is heavy-weight and the releases are sometimes based on several months old data. Hence, a strategy to keep DBpedia always in synchronization with Wikipedia is highly required. In this thesis we propose the DBpedia Live framework, which reads a continuous stream of updated Wikipedia articles, and processes it. DBpedia Live processes that stream on-the-fly to obtain RDF data and updates the DBpedia knowledge base with the newly extracted data. DBpedia Live also publishes the newly added/deleted facts in files, in order to enable synchronization between our DBpedia endpoint and other DBpedia mirrors. Moreover, the new DBpedia Live framework incorporates several significant features, e.g. abstract extraction, ontology changes, and changesets publication. Basically, knowledge bases, including DBpedia, are stored in triplestores in order to facilitate accessing and querying their respective data. Furthermore, the triplestores constitute the backbone of increasingly many Data Web applications. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in general. Consequently, it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triplestore implementations. We introduce a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triplestores and, thus, settled on measuring performance against a relational database which had been converted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applications against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more useful to compare existing triplestores and provide results for the popular triplestore implementations Virtuoso, Sesame, Apache Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the performance of triplestores is by far less homogeneous than suggested by previous benchmarks. Further, one of the crucial tasks when creating and maintaining knowledge bases is validating their facts and maintaining the quality of their inherent data. This task include several subtasks, and in thesis we address two of those major subtasks, specifically fact validation and provenance, and data quality The subtask fact validation and provenance aim at providing sources for these facts in order to ensure correctness and traceability of the provided knowledge This subtask is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time-consuming as the experts have to carry out several search processes and must often read several documents. We present DeFacto (Deep Fact Validation), which is an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. On the other hand the subtask of data quality maintenance aims at evaluating and continuously improving the quality of data of the knowledge bases. We present a methodology for assessing the quality of knowledge bases’ data, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia. info:eu-repo/classification/ddc/500 ddc:500
144	Marketing Research in the 21st Century: Opportunities and Challenges Hair, Joe F., Harrison, Dana E., Risher, Jeffrey J. 01 October 2018 (has links) The role of marketing is evolving rapidly, and design and analysis methods used by marketing researchers are also changing. These changes are emerging from transformations in management skills, technological innovations, and continuously evolving customer behavior. But perhaps the most substantial driver of these changes is the emergence of big data and the analytical methods used to examine and understand the data. To continue being relevant, marketing research must remain as dynamic as the markets themselves and adapt accordingly to the following: Data will continue increasing exponentially; data quality will improve; analytics will be more powerful, easier to use, and more widely used; management and customer decisions will increasingly be knowledge-based; privacy issues and challenges will be both a problem and an opportunity as organizations develop their analytics skills; data analytics will become firmly established as a competitive advantage, both in the marketing research industry and in academics; and for the foreseeable future, the demand for highly trained data scientists will exceed the supply. big data data analytics data quality marketing analytics marketing research Management and Marketing
145	Data Quality in the Interface of Industrial Manufacturing and Machine Learning / Data kvalité i gränssittet mellan industriel tillverkning och machine learning Timocin, Teoman January 2020 (has links) Innovations are coming together and are changing business landscapes, markets, and societies. Data-driven technologies form new or increase expectations on products, services, and business processes. Industrial companies must reconstruct both their physical environment and mindset to adapt accordingly successfully. One of the technologies paving the way for data-driven acceleration is machine learning. Machine learning-technologies require a high degree of structured digitalization and data to be functional. The technology has the potential to extract immense value for manufacturers because of its ability to analyse large quantities of data. The author of this thesis identified a research gap regarding how industrial manufacturers need to approach and prepare for machine learning technologies. Research indicated that data quality is one of the significant issues when organisations try to approach the technology. Earlier frameworks on data quality have not yet captured the aspects of manufacturing and machine learning as one. By reviewing data quality frameworks, including machine learning or manufacturing perspectives, the thesis aims to contribute with an area-specific data quality framework in the interface of machine learning and manufacturing. To gain further insights and to complement the current research in the areas, qualitative interviews were conducted with experts on machine learning, data and industrial manufacturing. The study finds that ten different data quality dimensions are essential for industrial manufacturers interested in machine learning. The insights from the framework contribute with knowledge to the data quality research, as well as providing industrial manufacturing companies with an understanding of machine learning data requirements. Business development Data Data Quality Dimensions Framework Industrial Manufacturing Machine Learning Strategy Business Administration Företagsekonomi
146	SILE: A Method for the Efficient Management of Smart Genomic Information León Palacio, Ana 25 November 2019 (has links) [ES] A lo largo de las últimas dos décadas, los datos generados por las tecnologías de secuenciación de nueva generación han revolucionado nuestro entendimiento de la biología humana. Es más, nos han permitido desarrollar y mejorar nuestro conocimiento sobre cómo los cambios (variaciones) en el ADN pueden estar relacionados con el riesgo de sufrir determinadas enfermedades. Actualmente, hay una gran cantidad de datos genómicos disponibles de forma pública, que son consultados con frecuencia por la comunidad científica para extraer conclusiones significativas sobre las asociaciones entre los genes de riesgo y los mecanismos que producen las enfermedades. Sin embargo, el manejo de esta cantidad de datos que crece de forma exponencial se ha convertido en un reto. Los investigadores se ven obligados a sumergirse en un lago de datos muy complejos que están dispersos en más de mil repositorios heterogéneos, representados en múltiples formatos y con diferentes niveles de calidad. Además, cuando se trata de resolver una tarea en concreto sólo una pequeña parte de la gran cantidad de datos disponibles es realmente significativa. Estos son los que nosotros denominamos datos "inteligentes". El principal objetivo de esta tesis es proponer un enfoque sistemático para el manejo eficiente de datos genómicos inteligentes mediante el uso de técnicas de modelado conceptual y evaluación de calidad de los datos. Este enfoque está dirigido a poblar un sistema de información con datos que sean lo suficientemente accesibles, informativos y útiles para la extracción de conocimiento de valor. / [CA] Al llarg de les últimes dues dècades, les dades generades per les tecnologies de secuenciació de nova generació han revolucionat el nostre coneixement sobre la biologia humana. És mes, ens han permès desenvolupar i millorar el nostre coneixement sobre com els canvis (variacions) en l'ADN poden estar relacionats amb el risc de patir determinades malalties. Actualment, hi ha una gran quantitat de dades genòmiques disponibles de forma pública i que són consultats amb freqüència per la comunitat científica per a extraure conclusions significatives sobre les associacions entre gens de risc i els mecanismes que produeixen les malalties. No obstant això, el maneig d'aquesta quantitat de dades que creix de forma exponencial s'ha convertit en un repte i els investigadors es veuen obligats a submergir-se en un llac de dades molt complexes que estan dispersos en mes de mil repositoris heterogenis, representats en múltiples formats i amb diferents nivells de qualitat. A m\és, quan es tracta de resoldre una tasca en concret només una petita part de la gran quantitat de dades disponibles és realment significativa. Aquests són els que nosaltres anomenem dades "intel·ligents". El principal objectiu d'aquesta tesi és proposar un enfocament sistemàtic per al maneig eficient de dades genòmiques intel·ligents mitjançant l'ús de tècniques de modelatge conceptual i avaluació de la qualitat de les dades. Aquest enfocament està dirigit a poblar un sistema d'informació amb dades que siguen accessibles, informatius i útils per a l'extracció de coneixement de valor. / [EN] In the last two decades, the data generated by the Next Generation Sequencing Technologies have revolutionized our understanding about the human biology. Furthermore, they have allowed us to develop and improve our knowledge about how changes (variants) in the DNA can be related to the risk of developing certain diseases. Currently, a large amount of genomic data is publicly available and frequently used by the research community, in order to extract meaningful and reliable associations among risk genes and the mechanisms of disease. However, the management of this exponential growth of data has become a challenge and the researchers are forced to delve into a lake of complex data spread in over thousand heterogeneous repositories, represented in multiple formats and with different levels of quality. Nevertheless, when these data are used to solve a concrete problem only a small part of them is really significant. This is what we call "smart" data. The main goal of this thesis is to provide a systematic approach to efficiently manage smart genomic data, by using conceptual modeling techniques and the principles of data quality assessment. The aim of this approach is to populate an Information System with data that are accessible, informative and actionable enough to extract valuable knowledge. / This thesis was supported by the Research and Development Aid Program (PAID-01-16) under the FPI grant 2137. / León Palacio, A. (2019). SILE: A Method for the Efficient Management of Smart Genomic Information [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/131698 / TESIS / Premios Extraordinarios de tesis doctorales Information Systems Smart Data SILE Conceptual Modeling Data Quality LENGUAJES Y SISTEMAS INFORMATICOS
147	Data Suitability Assessment and Enhancement for Machine Prognostics and Health Management Using Maximum Mean Discrepancy Jia, Xiaodong January 2018 (has links) No description available. Mechanical Engineering Data Suitability Machine Learning Data Quality Prognostics and Health Management Aero-Engine
148	Implementing simplified LCA software in heavy-duty vehicle design : An evaluation study of LCA data quality for supporting sustainable design decisions / Implementering av förenklad LCA-programvara i design av tunga fordon : En utvärderingsstudie av LCA-datakvalitet för att stödja hållbara designbeslut Chih-Chin Teng, Chih-Chin Teng January 2020 (has links) Simplified life cycle assessment (LCA) method quickly delivers an estimation of the product’s life- cycle impacts without intense data requirements, which are taken as a practical tool in the early stage of product development (PD) to support sustainable decisions. However, obstacles are to integrate the LCA tool efficiently and effectively into the designers’ daily workflows. To give a comprehensive overview of the potential challenges in integrating simplified LCA software to vehicle PD processes, the research conducts accessibility, intrinsic, contextual and representational data quality evaluation of the two vehicle-LCA software, Granta Selector and the Modular-LCA Kit, by the means of interviews, case studies and usability testing. From the four data quality evaluation, the results demonstrate (1) the importance of the company’s collaboration with the software developers to ensure the software’s accessibility; (2) the data accuracy constraints of the software due to the generic database and over-simplified methods; (3) the vehicle designer engineers reactions in the two software’s data fulfilments in conducting the complicated vehicle LCA models; and (4) the LCA results’ effectiveness in supporting sustainable design decisions. Overall, the two simplified LCA software’s reliability is sufficient merely in the very beginning stage of PD while the user satisfaction and effectiveness of the simplified LCA data are positive for the design engineers with a basic level of sustainability knowledge. Still, there is a need of systematic strategies in integrating the software into PD processes. A three-pillar strategy that covers the approaches of company administrative policy, software management, and promotion, and LCA and vehicle data life-cycle management could tackle the data gaps and limitations of the software and company. Based on this strategy, the research proposes an example roadmap for Scania. / Genom en förenklad livscykelanalys(LCA), kan man tidigt i produktutvecklingen få en indikation över ett fordons miljöpåverkan. Analysen kan agera som ett verktyg för att ge stöd till mer hållbara beslut i produktutvecklingen. En svårighet ligger dock i att integrera LCA i designers dagliga arbetsflöde på ett effektivt sätt. För att skapa en översikt av Scanias utvecklare och designers LCA- datakrav för hållbar fordonsutveckling genomfördes en datakvalitetsutvärdering (“accessibility, intrinsic, contextual, and representational”) av två LCA-programvaror, Granta Selector och Modular-LCA-kit. Från detta kunde en strategi och handlingsplan tas fram för implementering av LCA-programvara inom fordonsutveckling. Resultaten indikerar att programvarornas tillförlitlighet endast är tillräckliga i ett tidigt skede i produktutvecklingen. Dessutom varierar användarnas tillfredsställelse och effektiviteten av programvarans förenklade data utifrån designerns kunskapsnivå inom hållbarhet. För att ha en framgångsrik integrering av LCA-programvaran i fordonskonstruktionen, utvecklades en strategi med tre pelare. Dessa täcker Scanias företagspolicy och mjukvaruhantering samt hanteringen av livscykel inventariet och BOM-data, för att hantera brister i dataseten men även begränsningar hos programvaran och företaget. Baserat på denna strategi presenteras en möjlig handlingsplan för Scania. Life cycle assessment sustainable product development software data quality long haulage vehicle Engineering and Technology Teknik och teknologier
149	Data quality and governance in a UK social housing initiative: Implications for smart sustainable cities Duvier, Caroline, Anand, Prathivadi B., Oltean-Dumbrava, Crina 03 March 2018 (has links) no / Smart Sustainable Cities (SSC) consist of multiple stakeholders, who must cooperate in order for SSCs to be successful. Housing is an important challenge and in many cities, therefore, a key stakeholder are social housing organisations. This paper introduces a qualitative case study of a social housing provider in the UK who implemented a business intelligence project (a method to assess data networks within an organisation) to increase data quality and data interoperability. Our analysis suggests that creating pathways for different information systems within an organisation to ‘talk to’ each other is the first step. Some of the issues during the project implementation include the lack of training and development, organisational reluctance to change, and the lack of a project plan. The challenges faced by the organisation during this project can be helpful for those implementing SSCs. Currently, many SSC frameworks and models exist, yet most seem to neglect localised challenges faced by the different stakeholders. This paper hopes to help bridge this gap in the SSC research agenda. Data quality Data interoperability Social housing Smart sustainable cities Business intelligence
150	Anomaly Detection in Time Series Data Based on Holt-Winters Method / Anomalidetektering i tidsseriedata baserat på Holt-Winters metod Aboode, Adam January 2018 (has links) In today's world the amount of collected data increases every day, this is a trend which is likely to continue. At the same time the potential value of the data does also increase due to the constant development and improvement of hardware and software. However, in order to gain insights, make decisions or train accurate machine learning models we want to ensure that the data we collect is of good quality. There are many definitions of data quality, in this thesis we focus on the accuracy aspect. One method which can be used to ensure accurate data is to monitor for and alert on anomalies. In this thesis we therefore suggest a method which, based on historic values, is able to detect anomalies in time series as new values arrive. The method consists of two parts, forecasting the next value in the time series using Holt-Winters method and comparing the residual to an estimated Gaussian distribution. The suggested method is evaluated in two steps. First, we evaluate the forecast accuracy for Holt-Winters method using different input sizes. In the second step we evaluate the performance of the anomaly detector when using different methods to estimate the variance of the distribution of the residuals. The results indicate that the suggested method works well most of the time for detection of point anomalies in seasonal and trending time series data. The thesis also discusses some potential next steps which are likely to further improve the performance of this method. / I dagens värld ökar mängden insamlade data för varje dag som går, detta är en trend som sannolikt kommer att fortsätta. Samtidigt ökar även det potentiella värdet av denna data tack vare ständig utveckling och förbättring utav både hårdvara och mjukvara. För att utnyttja de stora mängder insamlade data till att skapa insikter, ta beslut eller träna noggranna maskininlärningsmodeller vill vi försäkra oss om att vår data är av god kvalité. Det finns många definitioner utav datakvalité, i denna rapport fokuserar vi på noggrannhetsaspekten. En metod som kan användas för att säkerställa att data är av god kvalité är att övervaka inkommande data och larma när anomalier påträffas. Vi föreslår därför i denna rapport en metod som, baserat på historiska data, kan detektera anomalier i tidsserier när nya värden anländer. Den föreslagna metoden består utav två delar, dels att förutsäga nästa värde i tidsserien genom Holt-Winters metod samt att jämföra residualen med en estimerad normalfördelning. Vi utvärderar den föreslagna metoden i två steg. Först utvärderas noggrannheten av de, utav Holt-Winters metod, förutsagda punkterna för olika storlekar på indata. I det andra steget utvärderas prestandan av anomalidetektorn när olika metoder för att estimera variansen av residualernas distribution används. Resultaten indikerar att den föreslagna metoden i de flesta fall fungerar bra för detektering utav punktanomalier i tidsserier med en trend- och säsongskomponent. I rapporten diskuteras även möjliga åtgärder vilka sannolikt skulle förbättra prestandan hos den föreslagna metoden. data quality anomaly detection time series data Holt-Winters method Computer Sciences Datavetenskap (datalogi)

Search results