151 |
Quality data extraction methodology based on the labeling of coffee leaves with nutritional deficienciesJungbluth, Adolfo, Yeng, Jon Li 04 1900 (has links)
El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado. / Nutritional deficiencies detection for coffee leaves is a task which is often undertaken manually by experts on the field known as agronomists. The process they follow to carry this task is based on observation of the different characteristics of the coffee leaves while relying on their own experience. Visual fatigue and human error in this empiric approach cause leaves to be incorrectly labeled and thus affecting the quality of the data obtained. In this context, different crowdsourcing approaches can be applied to enhance the quality of the data extracted. These approaches separately propose the use of voting systems, association rule filters and evolutive learning. In this paper, we extend the use of association rule filters and evolutive approach by combining them in a methodology to enhance the quality of the data while guiding the users during the main stages of data extraction tasks. Moreover, our methodology proposes a reward component to engage users and keep them motivated during the crowdsourcing tasks. The extracted dataset by applying our proposed methodology in a case study on Peruvian coffee leaves resulted in 93.33% accuracy with 30 instances collected by 8 experts and evaluated by 2 agronomic engineers with background on coffee leaves. The accuracy of the dataset was higher than independently implementing the evolutive feedback strategy and an empiric approach which resulted in 86.67% and 70% accuracy respectively under the same conditions. / Revisión por pares
|
152 |
Možnosti zpracování a využití otevřených dat / Utilization of Open DataFerdan, Ondřej January 2016 (has links)
Main goal of this diploma thesis is characterization of open data, standards and analyzation of adoption and utilization of open principles in the public sector of the Czech Republic. And comparison with European Union and chosen countries. Identifies technology and tools for linked data, used for deployment of highest rating of data openness. Defines geographical data, its standards and INSPIRE directive for spatial information in Europe. The goal of practical part of thesis is to analyze adoption of open principles for geographical data between Czech institutions. Focusing on what data are available, if open principles are applied and on what circumstances are data available. Foreign countries are also covered for the comparison.
|
153 |
Distribuindo dados e consultas em um ambiente de data warehousing na webPALILOT, Álvaro Alencar Barbosa 31 January 2010 (has links)
Made available in DSpace on 2014-06-12T15:55:09Z (GMT). No. of bitstreams: 2
arquivo2165_1.pdf: 4172677 bytes, checksum: ea3ea3e11ec0d8121f94e360f3eba253 (MD5)
license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5)
Previous issue date: 2010 / Nos dias atuais, uma das ferramentas mais utilizadas de Business Intelligence
(BI) para o suporte à decisão da alta gerência de grandes companhias é o Data
Warehouse (DW). O DW é um banco de dados que armazena seus dados de uma forma
especial para que se otimizem as consultas orientadas ao negócio, além dos dados
terem como características a não volatilidade, serem históricos e integrados. O
ambiente em que o DW está inserido é o Data Warehousing que contempla não só o
DW mais outros componentes que o ajudam a desempenhar a sua atividade fim.
O aumento da quantidade de usuários utilizando esse ambiente, o crescimento
exponencial do tamanho do DW, além da necessidade de otimizar as consultas e
atender localmente os interesses da diretoria dos departamentos ou filiais específicas,
fez com que pesquisadores da área de banco de dados buscassem soluções para obter
a distribuição dos dados e consultas de uma forma transparente e segura em um
ambiente de data warehousing. Atualmente, existem vários trabalhos correlatos nessa
linha de pesquisa, porém nenhum demonstra na prática o resultado efetivo de uma
arquitetura que contemple essas vantagens.
Esse trabalho toma como base a arquitetura do sistema WebD²W (Web
Distributed Data Warehousing) proposta por Cristina Ciferri para efetivar essa
distribuição. Assim, foram desenvolvidos o componente de distribuição, utilizando o
conceito de grafos de derivação para o desenvolvimento de algoritmos de
fragmentação horizontal e mista, e o componente de consulta do ambiente
distribuído, estendendo o servidor OLAP Mondrian para atender às necessidades
impostas por essa nova arquitetura. Finalmente, um DW de uma rede de locadoras de
DVD foi gerado para ser utilizado como estudo de caso para mostrar a aplicabilidade e
eficiência desses componentes
|
154 |
Utvärdering av riktlinjer för inkorporering av syndikat data i datalager : Praktikfältets syn på tillämpbarhet och nyttoeffekt av Strands riktlinjer för inkorporering av syndikat data i datalager.Helander, Magnus January 2005 (has links)
Inkorporering av extern data i datalager är problematiskt och problematiken bekräftas av aktuella undersökningar inom området. Detta har medfört att det utvecklats olika former av stöd för att bemöta och analysera problemen som organisationer ställs inför. För organisationer är det i högsta grad viktigt att dess beslutsfattare är välinformerade och klarar av att selektera information från stora mängder data. Det är i dessa sammanhang som en datalagerlösning är en viktig hörnsten för att stödja analys och presentation av data som ursprungligen är lagrad i olika datakällor (både interna och externa). Genom att inkorporera extern data i datalagret uppnår datalagret en betydligt högre potential och således kan även organisationer och framförallt dess beslutsfattare utvinna stora fördelar. Strand (2005) har tagit fram riktlinjer för att stödja inkorporeringsprocessen av extern data i datalager. Dock saknas en utvärdering av riktlinjerna. En utvärdering bidrar till att riktlinjernas trovärdighet stärks och att riktlinjerna på ett tidigt stadie förs in i en förvaltningsprocess.
|
155 |
Webové aplikace s využitím Linked Open Data / Web application using the Linked Open DataLe Xuan, Dung January 2014 (has links)
This thesis deals with the issue of open data. The aim is to introduce to reader the currently very popular topic. Linking these data together gives us more advantages and opportuni-ties, however a large number of open data datasets are published in the format that cannot be linked together. Therefore, the author put great emphasis into his work on Linked Data. Emphasis is not placed only on the emergence, current status and future development, but also on the technical aspect. First, readers will be familiar with theoretical concepts, principles of Linked Open Data, expansion of open government data in the Czech Republic and abroad. In the next chapter, the author aimed at the data formats RDF, SPARQL language, etc. In the last section, the author introduce to readers the tools to work with Linked Open Data and design sample application using the Linked Open Data. The benefit of the whole work is a comprehensive view of the Linked Open Data both from a theoretical and from a practical part. The main goal is to provide to readers quality introduction to the issue.
|
156 |
The role academic libraries could play in developing research data management services : a case of Makerere University LibrarySsebulime, Joseph 08 November 2017 (has links)
Research data management (RDM) focuses on the organization and description of data, from
its entry to the research cycle through to the dissemination and archiving of valuable results.
RDM entails storage, security, preservation, compliance, quality, sharing and jurisdiction. In the
academic world, RDM can support the research process by searching for relevant data, storing
data, describing data and advising researchers on good RDM practice.
This study focused on developing RDM services. The aim of the study was to establish the role
Makerere University Library could play in developing RDM Services. A number of questions
were formulated to guide the researcher in finding answers to the research questions.
A literature review, based on the research sub-questions, was carried out. The review covered
the concept of RDM, academic libraries and their RDM practices, various RDM services in
academic libraries, RDM services that require sustainability and how current researchers, in
general, manage their data.
The research undertaken took a qualitative approach with a case study design. This was due to
the need to gather in-depth and comprehensive views and experiences regarding RDM at
Makerere University. A purposive sampling technique was used to identify researchers who are
actively involved in managing research data at Makerere University. Data were collected using
semi structured interviews, from eight participants; one from each college. The participants were
selected because of their knowledge about RDM and semi-structured interviews were preferred
due to their flexibility. An interview schedule was used as the data collection instrument. Data
was transcribed into Microsoft Word for easy analysis.
Findings that addressed the research question and sub-questions were presented and
interpreted in chapter four and conclusions as well as recommendations were discussed in
detail in chapter five of this research report. In summary it is possible to say that although
researchers, from across the entire university, generate big volumes of research data it appears
that researchers themselves manage, control and store their data making use of different
removable devices. This is risky. So there is a need to develop RDM skills for all stakeholders. It
does appear though that the researchers at Makerere University would be willing the support of
RDM services if these are developed by the library. / Mini Dissertation (MIT)--University of Pretoria, 2017. / Carnegie Corporation of New York / Information Science / MIT / Unrestricted
|
157 |
A New Method and Python Toolkit for General Access to Spatiotemporal N-Dimensional Raster DataHales, Riley Chad 29 March 2021 (has links)
Scientific datasets from global-scale scientific models and remote sensing instruments are becoming available at greater spatial and temporal resolutions with shorter lag times. These data are frequently gridded measurements spanning two or three spatial dimensions, the time dimension, and often several data dimensions which vary by the specific dataset. These data are useful in many modeling and analysis applications across the geosciences. Unlike vector spatial datasets, raster spatial datasets lack widely adopted conventions in file formats, data organization, and dissemination mechanisms. Raster datasets are often saved using the Network Common Data Format (NetCDF), Gridded Binary (GRIB), Hierarchical Data Format (HDF), or Geographic Tagged Image File Format (GeoTIFF) file formats. Several of these are entirely or partially incompatible with common GIS software which introduces additional complexity in extracting values from these datasets. We present a method and companion Python package as a general-purpose tool for extracting time series subsets from these files using various spatial geometries. This method and tool enable efficient access to multidimensional data regardless of the format of the data. This research builds on existing file formats and software rather than suggesting new alternatives. We also present an analysis of optimizations and performance.
|
158 |
The role of academic libraries in implementing research data services: a case study of the University of KwaZulu-Natal LibrariesMadibi, Zizipho 22 February 2022 (has links)
This study investigated the role of academic libraries in implementing research data services, UKZN being the case study. The objectives of the study were to identify the need for research data services among UKZN researchers, to identify the major challenges associated with introducing research data services at UKZN, and to determine the possibility of implementing research data services at UKZN Libraries. The Data Curation Centre Lifecycle model was adopted as a framework for the study as it manages to connect the different stages of research data management. The study took a mixed methods approach of which interviews and a survey were used. A purposive sample was used to select library staff and random sample was drawn from 1341 UKZN academics. From a sample of 1341, 299 was the minimum size recommended by the Raosoft sample size calculator for a 5% margin of error and 95% confidence level. For quantitative analysis, an online questionnaire was administered using Google Forms. A series of questions were formulated for guidance in obtaining answers to the study objectives. Google Forms was used for the analysis while figures and tables were created using Microsoft Excel. Interviews from the library staff were recorded and data from interviews was transcribed into Microsoft Word. The study revealed that UKZN Libraries are still struggling with RDM policy development. The findings of the study revealed that researchers who responded to the study showed a lack of RDM awareness while library staff showed a moderate level of awareness. The study revealed that researchers at UKZN work with different types of data and they use different storage options such as removable storage devices, computer hard drives and cloud services. Although a few researchers have developed data management plans at UKZN, they have not done so because they were mandated by the institution - UKZN has not yet developed DMPs and library staff are not aware which funders require DMPs. The researchers who responded to the study showed interest in different trainings such as, training on data storage, development of DMPs and metadata creation. The library staff were more eager to provide data storage, data archiving and sharing mainly because of the existence of the UKZN data repository (Yabelana). Study recommendations are based on the analysed data. One of the recommendations was that UKZN Libraries should assume a role of being an advisor and trainer for research data services at UKZN.
|
159 |
Preprocessing unbounded data for use in real time visualization : Building a visualization data cube of unbounded dataHallman, Isabelle January 2019 (has links)
This thesis evaluates the viability of a data cube as a basis for visualization of unbounded data. A cube designed for use with visualization of static data was adapted to allow for point-by-point insertions. The new cube was evaluated by measuring the time it took to insert different numbers of data points. The results indicate that the cube can keep up with data streams with a velocity of up to approximately 100 000 data points per second. The conclusion is that the cube is useful if the velocity of the data stream is within this bound, and if the granularity of the represented dimensions is sufficiently low. / Det här exjobbet utvärderar dugligheten av en datakub som bas för visualisering av obegränsad data. En kub designad för användning till visualisering av statisk data anpassades till att medge insättning punkt för punkt. Den nya kuben evaluerades genom att mäta tiden det tog att sätta in olika antal datapunkter. Resultaten indikerade att kuben kan hantera dataströmmar med en hastighet på upp till 100 000 punkter per sekund. Slutsatsen är att kuben är användbar om hastigheten av dataströmmen är inom denna gräns, och om grovheten av de representerade dimensionerna är tillräckligt hög.
|
160 |
Data Cleaning with Minimal Information DisclosureGairola, Dhruv 11 1900 (has links)
Businesses analyze large datasets in order to extract valuable insights from the data. Unfortunately, most real datasets contain errors that need to be corrected before any analysis. Businesses can utilize various data cleaning systems and algorithms to automate the correction of data errors. Many systems correct the data errors by using information present within the dirty dataset itself. Some also incorporate user feedback in order to validate the quality of the suggested data corrections. However, users are not always available for feedback. Hence, some systems rely on clean data sources to help with the data cleaning process. This involves comparing records between the dirty dataset and the clean dataset in order to detect high quality fixes for the erroneous data. Every record in the dirty dataset is compared with every record in the clean dataset in order to find similar records. The values of the records in the clean dataset can be used to correct the values of the erroneous records in the dirty dataset. Realistically, comparing records across two datasets may not be possible due to privacy reasons. For example, there are laws to restrict the free movement of personal data. Additionally, different records within a dataset may have different privacy requirements. Existing data cleaning systems do not factor in these privacy requirements on the respective datasets. This motivates the need for privacy aware data cleaning systems. In this thesis, we examine the role of privacy in the data cleaning process. We present a novel data cleaning framework that supports the cooperation between the clean and the dirty datasets such that the clean dataset discloses a minimal amount of information and the dirty dataset uses this information to (maximally) clean its data. We investigate the tradeoff between information disclosure and data cleaning utility, modelling this tradeoff as a multi-objective optimization problem within our framework. We propose four optimization functions to solve our optimization problem. Finally, we perform extensive experiments on datasets containing up to 3 million records by varying parameters such as the error rate of the dataset, the size of the dataset, the number of constraints on the dataset, etc and measure the impact on accuracy and performance for those parameters. Our results demonstrate that disclosing a larger amount of information within the clean dataset helps in cleaning the dirty dataset to a larger extent. We find that with 80% information disclosure (relative to the weighted optimization function), we are able to achieve a precision of 91% and a recall of 85%. We also compare our algorithms against each other to discover which ones produce better data repairs and which ones take longer to find repairs. We incorporate ideas from Barone et al. into our framework and show that our approach is 30% faster, but 7% worse for precision. We conclude that our data cleaning framework can be applied to real-world scenarios where controlling the amount of information disclosed is important. / Thesis / Master of Computer Science (MCS) / Businesses analyze large datasets in order to extract valuable insights from the data. Unfortunately, most real datasets contain errors that need to be corrected before any analysis. Businesses can utilize various data cleaning systems and algorithms to automate the correction of data errors. Many systems correct the data errors by using information present within the dirty dataset itself. Some also incorporate user feedback in order to validate the quality of the suggested data corrections. However, users are not always available for feedback. Hence, some systems rely on clean data sources to help with the data cleaning process. This involves comparing records between the dirty dataset and the clean dataset in order to detect high quality fixes for the erroneous data. Every record in the dirty dataset is compared with every record in the clean dataset in order to find similar records. The values of the records in the clean dataset can be used to correct the values of the erroneous records in the dirty dataset. Realistically, comparing records across two datasets may not be possible due to privacy reasons. For example, there are laws to restrict the free movement of personal data. Additionally, different records within a dataset may have different privacy requirements. Existing data cleaning systems do not factor in these privacy requirements on the respective datasets. This motivates the need for privacy aware data cleaning systems. In this thesis, we examine the role of privacy in the data cleaning process. We present a novel data cleaning framework that supports the cooperation between the clean and the dirty datasets such that the clean dataset discloses a minimal amount of information and the dirty dataset uses this information to (maximally) clean its data. We investigate the tradeoff between information disclosure and data cleaning utility, modelling this tradeoff as a multi-objective optimization problem within our framework. We propose four optimization functions to solve our optimization problem. Finally, we perform extensive experiments on datasets containing up to 3 million records by varying parameters such as the error rate of the dataset, the size of the dataset, the number of constraints on the dataset, etc and measure the impact on accuracy and performance for those parameters. Our results demonstrate that disclosing a larger amount of information within the clean dataset helps in cleaning the dirty dataset to a larger extent. We find that with 80% information disclosure (relative to the weighted optimization function), we are able to achieve a precision of 91% and a recall of 85%. We also compare our algorithms against each other to discover which ones produce better data repairs and which ones take longer to find repairs. We incorporate ideas from Barone et al. into our framework and show that our approach is 30% faster, but 7% worse for precision. We conclude that our data cleaning framework can be applied to real-world scenarios where controlling the amount of information disclosed is important.
|
Page generated in 0.1857 seconds