Global ETD Search

141	Integrade Linked Data / Linked Data Integration Michelfeit, Jan January 2013 (has links) Linked Data have emerged as a successful publication format which could mean to structured data what Web meant to documents. The strength of Linked Data is in its fitness for integration of data from multiple sources. Linked Data integration opens door to new opportunities but also poses new challenges. New algorithms and tools need to be developed to cover all steps of data integration. This thesis examines the established data integration proceses and how they can be applied to Linked Data, with focus on data fusion and conflict resolution. Novel algorithms for Linked Data fusion are proposed and the task of supporting trust with provenance information and quality assessment of fused data is addressed. The proposed algorithms are implemented as part of a Linked Data integration framework ODCleanStore.
142	Multiple Entity Reconciliation Samoila, Lavinia Andreea January 2015 (has links) Living in the age of "Big Data" is both a blessing and a curse. On he one hand, the raw data can be analysed and then used for weather redictions, user recommendations, targeted advertising and more. On he other hand, when data is aggregated from multiple sources, there is no guarantee that each source has stored the data in a standardized or even compatible format to what is required by the application. So there is a need to parse the available data and convert it to the desired form. Here is where the problems start to arise: often the correspondences are not quite so straightforward between data instances that belong to the same domain, but come from different sources. For example, in the film industry, information about movies (cast, characters, ratings etc.) can be found on numerous websites such as IMDb or Rotten Tomatoes. Finding and matching all the data referring to the same movie is a challenge. The aim of this project is to select the most efficient algorithm to correlate movie related information gathered from various websites automatically. We have implemented a flexible application that allows us to make the performance comparison of multiple algorithms based on machine learning techniques. According to our experimental results, a well chosen set of rules is on par with the results from a neural network, these two proving to be the most effective classifiers for records with movie information as content. entity matching data linkage data quality machine learning text processing Computer and Information Sciences Data- och informationsvetenskap
143	Efficient Extraction and Query Benchmarking of Wikipedia Data Morsey, Mohamed 12 April 2013 (has links) Knowledge bases are playing an increasingly important role for integrating information between systems and over the Web. Today, most knowledge bases cover only specific domains, they are created by relatively small groups of knowledge engineers, and it is very cost intensive to keep them up-to-date as domains change. In parallel, Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. The DBpedia (http://dbpedia.org) project makes use of this large collaboratively edited knowledge source by extracting structured content from it, interlinking it with other knowledge bases, and making the result publicly available. DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Furthermore, many companies and researchers use DBpedia and its public services to improve their applications and research approaches. However, the DBpedia release process is heavy-weight and the releases are sometimes based on several months old data. Hence, a strategy to keep DBpedia always in synchronization with Wikipedia is highly required. In this thesis we propose the DBpedia Live framework, which reads a continuous stream of updated Wikipedia articles, and processes it. DBpedia Live processes that stream on-the-fly to obtain RDF data and updates the DBpedia knowledge base with the newly extracted data. DBpedia Live also publishes the newly added/deleted facts in files, in order to enable synchronization between our DBpedia endpoint and other DBpedia mirrors. Moreover, the new DBpedia Live framework incorporates several significant features, e.g. abstract extraction, ontology changes, and changesets publication. Basically, knowledge bases, including DBpedia, are stored in triplestores in order to facilitate accessing and querying their respective data. Furthermore, the triplestores constitute the backbone of increasingly many Data Web applications. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in general. Consequently, it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triplestore implementations. We introduce a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triplestores and, thus, settled on measuring performance against a relational database which had been converted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applications against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more useful to compare existing triplestores and provide results for the popular triplestore implementations Virtuoso, Sesame, Apache Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the performance of triplestores is by far less homogeneous than suggested by previous benchmarks. Further, one of the crucial tasks when creating and maintaining knowledge bases is validating their facts and maintaining the quality of their inherent data. This task include several subtasks, and in thesis we address two of those major subtasks, specifically fact validation and provenance, and data quality The subtask fact validation and provenance aim at providing sources for these facts in order to ensure correctness and traceability of the provided knowledge This subtask is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time-consuming as the experts have to carry out several search processes and must often read several documents. We present DeFacto (Deep Fact Validation), which is an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. On the other hand the subtask of data quality maintenance aims at evaluating and continuously improving the quality of data of the knowledge bases. We present a methodology for assessing the quality of knowledge bases’ data, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia. info:eu-repo/classification/ddc/500 ddc:500
144	Marketing Research in the 21st Century: Opportunities and Challenges Hair, Joe F., Harrison, Dana E., Risher, Jeffrey J. 01 October 2018 (has links) The role of marketing is evolving rapidly, and design and analysis methods used by marketing researchers are also changing. These changes are emerging from transformations in management skills, technological innovations, and continuously evolving customer behavior. But perhaps the most substantial driver of these changes is the emergence of big data and the analytical methods used to examine and understand the data. To continue being relevant, marketing research must remain as dynamic as the markets themselves and adapt accordingly to the following: Data will continue increasing exponentially; data quality will improve; analytics will be more powerful, easier to use, and more widely used; management and customer decisions will increasingly be knowledge-based; privacy issues and challenges will be both a problem and an opportunity as organizations develop their analytics skills; data analytics will become firmly established as a competitive advantage, both in the marketing research industry and in academics; and for the foreseeable future, the demand for highly trained data scientists will exceed the supply. big data data analytics data quality marketing analytics marketing research Management and Marketing
145	Data Quality in the Interface of Industrial Manufacturing and Machine Learning / Data kvalité i gränssittet mellan industriel tillverkning och machine learning Timocin, Teoman January 2020 (has links) Innovations are coming together and are changing business landscapes, markets, and societies. Data-driven technologies form new or increase expectations on products, services, and business processes. Industrial companies must reconstruct both their physical environment and mindset to adapt accordingly successfully. One of the technologies paving the way for data-driven acceleration is machine learning. Machine learning-technologies require a high degree of structured digitalization and data to be functional. The technology has the potential to extract immense value for manufacturers because of its ability to analyse large quantities of data. The author of this thesis identified a research gap regarding how industrial manufacturers need to approach and prepare for machine learning technologies. Research indicated that data quality is one of the significant issues when organisations try to approach the technology. Earlier frameworks on data quality have not yet captured the aspects of manufacturing and machine learning as one. By reviewing data quality frameworks, including machine learning or manufacturing perspectives, the thesis aims to contribute with an area-specific data quality framework in the interface of machine learning and manufacturing. To gain further insights and to complement the current research in the areas, qualitative interviews were conducted with experts on machine learning, data and industrial manufacturing. The study finds that ten different data quality dimensions are essential for industrial manufacturers interested in machine learning. The insights from the framework contribute with knowledge to the data quality research, as well as providing industrial manufacturing companies with an understanding of machine learning data requirements. Business development Data Data Quality Dimensions Framework Industrial Manufacturing Machine Learning Strategy Business Administration Företagsekonomi
146	Data Suitability Assessment and Enhancement for Machine Prognostics and Health Management Using Maximum Mean Discrepancy Jia, Xiaodong January 2018 (has links) No description available. Mechanical Engineering Data Suitability Machine Learning Data Quality Prognostics and Health Management Aero-Engine
147	Implementing simplified LCA software in heavy-duty vehicle design : An evaluation study of LCA data quality for supporting sustainable design decisions / Implementering av förenklad LCA-programvara i design av tunga fordon : En utvärderingsstudie av LCA-datakvalitet för att stödja hållbara designbeslut Chih-Chin Teng, Chih-Chin Teng January 2020 (has links) Simplified life cycle assessment (LCA) method quickly delivers an estimation of the product’s life- cycle impacts without intense data requirements, which are taken as a practical tool in the early stage of product development (PD) to support sustainable decisions. However, obstacles are to integrate the LCA tool efficiently and effectively into the designers’ daily workflows. To give a comprehensive overview of the potential challenges in integrating simplified LCA software to vehicle PD processes, the research conducts accessibility, intrinsic, contextual and representational data quality evaluation of the two vehicle-LCA software, Granta Selector and the Modular-LCA Kit, by the means of interviews, case studies and usability testing. From the four data quality evaluation, the results demonstrate (1) the importance of the company’s collaboration with the software developers to ensure the software’s accessibility; (2) the data accuracy constraints of the software due to the generic database and over-simplified methods; (3) the vehicle designer engineers reactions in the two software’s data fulfilments in conducting the complicated vehicle LCA models; and (4) the LCA results’ effectiveness in supporting sustainable design decisions. Overall, the two simplified LCA software’s reliability is sufficient merely in the very beginning stage of PD while the user satisfaction and effectiveness of the simplified LCA data are positive for the design engineers with a basic level of sustainability knowledge. Still, there is a need of systematic strategies in integrating the software into PD processes. A three-pillar strategy that covers the approaches of company administrative policy, software management, and promotion, and LCA and vehicle data life-cycle management could tackle the data gaps and limitations of the software and company. Based on this strategy, the research proposes an example roadmap for Scania. / Genom en förenklad livscykelanalys(LCA), kan man tidigt i produktutvecklingen få en indikation över ett fordons miljöpåverkan. Analysen kan agera som ett verktyg för att ge stöd till mer hållbara beslut i produktutvecklingen. En svårighet ligger dock i att integrera LCA i designers dagliga arbetsflöde på ett effektivt sätt. För att skapa en översikt av Scanias utvecklare och designers LCA- datakrav för hållbar fordonsutveckling genomfördes en datakvalitetsutvärdering (“accessibility, intrinsic, contextual, and representational”) av två LCA-programvaror, Granta Selector och Modular-LCA-kit. Från detta kunde en strategi och handlingsplan tas fram för implementering av LCA-programvara inom fordonsutveckling. Resultaten indikerar att programvarornas tillförlitlighet endast är tillräckliga i ett tidigt skede i produktutvecklingen. Dessutom varierar användarnas tillfredsställelse och effektiviteten av programvarans förenklade data utifrån designerns kunskapsnivå inom hållbarhet. För att ha en framgångsrik integrering av LCA-programvaran i fordonskonstruktionen, utvecklades en strategi med tre pelare. Dessa täcker Scanias företagspolicy och mjukvaruhantering samt hanteringen av livscykel inventariet och BOM-data, för att hantera brister i dataseten men även begränsningar hos programvaran och företaget. Baserat på denna strategi presenteras en möjlig handlingsplan för Scania. Life cycle assessment sustainable product development software data quality long haulage vehicle Engineering and Technology Teknik och teknologier
148	Data quality and governance in a UK social housing initiative: Implications for smart sustainable cities Duvier, Caroline, Anand, Prathivadi B., Oltean-Dumbrava, Crina 03 March 2018 (has links) no / Smart Sustainable Cities (SSC) consist of multiple stakeholders, who must cooperate in order for SSCs to be successful. Housing is an important challenge and in many cities, therefore, a key stakeholder are social housing organisations. This paper introduces a qualitative case study of a social housing provider in the UK who implemented a business intelligence project (a method to assess data networks within an organisation) to increase data quality and data interoperability. Our analysis suggests that creating pathways for different information systems within an organisation to ‘talk to’ each other is the first step. Some of the issues during the project implementation include the lack of training and development, organisational reluctance to change, and the lack of a project plan. The challenges faced by the organisation during this project can be helpful for those implementing SSCs. Currently, many SSC frameworks and models exist, yet most seem to neglect localised challenges faced by the different stakeholders. This paper hopes to help bridge this gap in the SSC research agenda. Data quality Data interoperability Social housing Smart sustainable cities Business intelligence
149	Anomaly Detection in Time Series Data Based on Holt-Winters Method / Anomalidetektering i tidsseriedata baserat på Holt-Winters metod Aboode, Adam January 2018 (has links) In today's world the amount of collected data increases every day, this is a trend which is likely to continue. At the same time the potential value of the data does also increase due to the constant development and improvement of hardware and software. However, in order to gain insights, make decisions or train accurate machine learning models we want to ensure that the data we collect is of good quality. There are many definitions of data quality, in this thesis we focus on the accuracy aspect. One method which can be used to ensure accurate data is to monitor for and alert on anomalies. In this thesis we therefore suggest a method which, based on historic values, is able to detect anomalies in time series as new values arrive. The method consists of two parts, forecasting the next value in the time series using Holt-Winters method and comparing the residual to an estimated Gaussian distribution. The suggested method is evaluated in two steps. First, we evaluate the forecast accuracy for Holt-Winters method using different input sizes. In the second step we evaluate the performance of the anomaly detector when using different methods to estimate the variance of the distribution of the residuals. The results indicate that the suggested method works well most of the time for detection of point anomalies in seasonal and trending time series data. The thesis also discusses some potential next steps which are likely to further improve the performance of this method. / I dagens värld ökar mängden insamlade data för varje dag som går, detta är en trend som sannolikt kommer att fortsätta. Samtidigt ökar även det potentiella värdet av denna data tack vare ständig utveckling och förbättring utav både hårdvara och mjukvara. För att utnyttja de stora mängder insamlade data till att skapa insikter, ta beslut eller träna noggranna maskininlärningsmodeller vill vi försäkra oss om att vår data är av god kvalité. Det finns många definitioner utav datakvalité, i denna rapport fokuserar vi på noggrannhetsaspekten. En metod som kan användas för att säkerställa att data är av god kvalité är att övervaka inkommande data och larma när anomalier påträffas. Vi föreslår därför i denna rapport en metod som, baserat på historiska data, kan detektera anomalier i tidsserier när nya värden anländer. Den föreslagna metoden består utav två delar, dels att förutsäga nästa värde i tidsserien genom Holt-Winters metod samt att jämföra residualen med en estimerad normalfördelning. Vi utvärderar den föreslagna metoden i två steg. Först utvärderas noggrannheten av de, utav Holt-Winters metod, förutsagda punkterna för olika storlekar på indata. I det andra steget utvärderas prestandan av anomalidetektorn när olika metoder för att estimera variansen av residualernas distribution används. Resultaten indikerar att den föreslagna metoden i de flesta fall fungerar bra för detektering utav punktanomalier i tidsserier med en trend- och säsongskomponent. I rapporten diskuteras även möjliga åtgärder vilka sannolikt skulle förbättra prestandan hos den föreslagna metoden. data quality anomaly detection time series data Holt-Winters method Computer Sciences Datavetenskap (datalogi)
150	An Investigation into Improving the Repeatability of Steady- State Measurements from Nonlinear Systems. Methods for measuring repeatable data from steady-state engine tests were evaluated. A comprehensive and novel approach to acquiring high quality steady-state emissions data was developed Dwyer, Thomas P. January 2014 (has links) The calibration of modern internal combustion engines requires ever improving measurement data quality such that they comply with increasingly stringent emissions legislation. This study establishes methodology and a software tool to improve the quality of steady-state emissions measurements from engine dynamometer tests. Literature shows state of the art instrumentation are necessary to monitor the cycle-by-cycle variations that significantly alter emissions measurements. Test methodologies that consider emissions formation mechanisms invariably focus on thermal transients and preconditioning of internal surfaces. This work sought data quality improvements using three principle approaches. An adapted steady-state identifier to more reliably indicate when the test conditions reached steady-state; engine preconditioning to reduce the influence of the prior day’s operating conditions on the measurements; and test point ordering to reduce measurement deviation. Selection of an improved steady-state indicator was identified using correlations in test data. It was shown by repeating forty steady-state test points that a more robust steady-state indicator has the potential to reduce the measurement deviation of particulate number by 6%, unburned hydrocarbons by 24%, carbon monoxide by 10% and oxides of nitrogen by 29%. The variation of emissions measurements from those normally observed at a repeat baseline test point were significantly influenced by varying the preconditioning power. Preconditioning at the baseline operating condition converged emissions measurements with the mean of those typically observed. Changing the sequence of steady-state test points caused significant differences in the measured engine performance. Examining the causes of measurement deviation allowed an optimised test point sequencing method to be developed. A 30% reduction in measurement deviation of a targeted engine response (particulate number emissions) was obtained using the developed test methodology. This was achieved by selecting an appropriate steady-state indicator and sequencing test points. The benefits of preconditioning were deemed short-lived and impractical to apply in every-day engine testing although the principles were considered when developing the sequencing methodology.

Search results