Global ETD Search

31	Monitoramento de doadores de sangue através de integração de bases de texto heterogêneas Pinha, André Teixeira January 2016 (has links) Orientador: Prof. Dr. Márcio Katsumi Oikawa / Dissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, 2016. / Através do relacionamento probabilístico de bases de dados é possível obter informações que a análise individual ou manual de bases de dados não proporcionaria. Esse trabalho visa encontrar, através do relacionamento probabilístico de registros, doadores de sangue da base de dados da Fundação Pró-Sangue (FPS) no Sistema de Informações sobre Mortalidade (SIM), nos anos de 2001 a 2006, favorecendo assim a manutenção de hemoderivados da instituição, inferindo se determinado doador veio à óbito. Para tal, foram avaliadas a eficiência de diferentes chaves de blocking que foram aplicadas em um conjunto de softwares gratuitos de record linkage e no software implementado para uso específico do estudo, intitulado SortedLink. Nos estudos, os registros foram padronizados e apenas os que possuíam dados da mãe cadastrados foram utilizados. Para avaliar a eficiência das chaves de blocking, foram selecionados 100.000 registros aleatoriamente das bases de dados SIM e FPS, e adicionados 30 registros de validação para cada conjunto. Sendo que o software SortedLink, implementado no trabalho, foi o que apresentou os melhores resultados e foi utilizado para obter os resultados dos possíveis pares de registros na base total de dados, 1.709.819 de registros para o SIM e 334.077 para o FPS. Além disso, o estudo também avalia a eficiência dos algoritmos de codificação fonética SOUNDEX, tipicamente utilizado no processo de record linkage, e do BRSOUND, desenvolvido para codificação de nomes e sobrenomes oriundos da língua portuguesa do Brasil. / Through probabilistic record linkage of databases is possible to obtain information that the individual or manual analysis of databases do not provide. This work aims to find, through probabilistic record relationship, blood donors from the database of Fundação Pró-Sangue (FPS) in the Sistema de Informações sobre Mortalidade (SIM) from Brazil, in the year 2001 to 2006, thus favoring maintenance blood products of the institution, inferring whether a donor came to death. For this purpose, we evaluated the effectiveness of different blocking keys that were applied to a set of free software record linkage and a software implemented for specific use of the study, entitled SortedLink. In the studies, the records were standardized and only those who had registered mother information were used. To assess the effectiveness of blocking keys were selected randomly 100, 000 records of SIM and FPS databases, and added 30 validation records for each set. Since the SortedLink software, implemented in this work, showed the best results, it was used to obtain the results of the possible pairs of records in the total database, 1.709.819 records from SIM and 334.077 from FPS. In addition, the study also evaluated the efficiency of SOUNDEX phonetic encoding algorithms, typically used in the record linkage process and the BRSOUND, developed for encoding names and surnames derived from the Portuguese language of Brazil. RELACIONAMENTO DE REGISTROS RELACIONAMENTO PROBABILÍSTICO LIMPEZA DE DADOS RECORD LINKAGE DATA LINKAGE DATA CLEANING
32	Avaliação experimental de uma técnica de padronização de escores de similaridade / Experimental evaluation of a similarity score standardization technique Nunes, Marcos Freitas January 2009 (has links) Com o crescimento e a facilidade de acesso a Internet, o volume de dados cresceu muito nos últimos anos e, consequentemente, ficou muito fácil o acesso a bases de dados remotas, permitindo integrar dados fisicamente distantes. Geralmente, instâncias de um mesmo objeto no mundo real, originadas de bases distintas, apresentam diferenças na representação de seus valores, ou seja, os mesmos dados no mundo real podem ser representados de formas diferentes. Neste contexto, surgiram os estudos sobre casamento aproximado utilizando funções de similaridade. Por consequência, surgiu a dificuldade de entender os resultados das funções e selecionar limiares ideais. Quando se trata de casamento de agregados (registros), existe o problema de combinar os escores de similaridade, pois funções distintas possuem distribuições diferentes. Com objetivo de contornar este problema, foi desenvolvida em um trabalho anterior uma técnica de padronização de escores, que propõe substituir o escore calculado pela função de similaridade por um escore ajustado (calculado através de um treinamento), o qual é intuitivo para o usuário e pode ser combinado no processo de casamento de registros. Tal técnica foi desenvolvida por uma aluna de doutorado do grupo de Banco de Dados da UFRGS e será chamada aqui de MeaningScore (DORNELES et al., 2007). O presente trabalho visa estudar e realizar uma avaliação experimental detalhada da técnica MeaningScore. Com o final do processo de avaliação aqui executado, é possível afirmar que a utilização da abordagem MeaningScore é válida e retorna melhores resultados. No processo de casamento de registros, onde escores de similaridades distintos devem ser combinados, a utilização deste escore padronizado ao invés do escore original, retornado pela função de similaridade, produz resultados com maior qualidade. / With the growth of the Web, the volume of information grew considerably over the past years, and consequently, the access to remote databases became easier, which allows the integration of distributed information. Usually, instances of the same object in the real world, originated from distinct databases, present differences in the representation of their values, which means that the same information can be represented in different ways. In this context, research on approximate matching using similarity functions arises. As a consequence, there is a need to understand the result of the functions and to select ideal thresholds. Also, when matching records, there is the problem of combining the similarity scores, since distinct functions have different distributions. With the purpose of overcoming this problem, a previous work developed a technique that standardizes the scores, by replacing the computed score by an adjusted score (computed through a training), which is more intuitive for the user and can be combined in the process of record matching. This work was developed by a Phd student from the UFRGS database research group, and is referred to as MeaningScore (DORNELES et al., 2007). The present work intends to study and perform an experimental evaluation of this technique. As the validation shows, it is possible to say that the usage of the MeaningScore approach is valid and return better results. In the process of record matching, where distinct similarity must be combined, the usage of the adjusted score produces results with higher quality. Armazenamento : Dados Banco : Dados Métricas : Similaridade Consulta : Similaridade Similarity querying Data integration Data cleaning Record matching Adjusted score Data quality
33	Avaliação experimental de uma técnica de padronização de escores de similaridade / Experimental evaluation of a similarity score standardization technique Nunes, Marcos Freitas January 2009 (has links) Com o crescimento e a facilidade de acesso a Internet, o volume de dados cresceu muito nos últimos anos e, consequentemente, ficou muito fácil o acesso a bases de dados remotas, permitindo integrar dados fisicamente distantes. Geralmente, instâncias de um mesmo objeto no mundo real, originadas de bases distintas, apresentam diferenças na representação de seus valores, ou seja, os mesmos dados no mundo real podem ser representados de formas diferentes. Neste contexto, surgiram os estudos sobre casamento aproximado utilizando funções de similaridade. Por consequência, surgiu a dificuldade de entender os resultados das funções e selecionar limiares ideais. Quando se trata de casamento de agregados (registros), existe o problema de combinar os escores de similaridade, pois funções distintas possuem distribuições diferentes. Com objetivo de contornar este problema, foi desenvolvida em um trabalho anterior uma técnica de padronização de escores, que propõe substituir o escore calculado pela função de similaridade por um escore ajustado (calculado através de um treinamento), o qual é intuitivo para o usuário e pode ser combinado no processo de casamento de registros. Tal técnica foi desenvolvida por uma aluna de doutorado do grupo de Banco de Dados da UFRGS e será chamada aqui de MeaningScore (DORNELES et al., 2007). O presente trabalho visa estudar e realizar uma avaliação experimental detalhada da técnica MeaningScore. Com o final do processo de avaliação aqui executado, é possível afirmar que a utilização da abordagem MeaningScore é válida e retorna melhores resultados. No processo de casamento de registros, onde escores de similaridades distintos devem ser combinados, a utilização deste escore padronizado ao invés do escore original, retornado pela função de similaridade, produz resultados com maior qualidade. / With the growth of the Web, the volume of information grew considerably over the past years, and consequently, the access to remote databases became easier, which allows the integration of distributed information. Usually, instances of the same object in the real world, originated from distinct databases, present differences in the representation of their values, which means that the same information can be represented in different ways. In this context, research on approximate matching using similarity functions arises. As a consequence, there is a need to understand the result of the functions and to select ideal thresholds. Also, when matching records, there is the problem of combining the similarity scores, since distinct functions have different distributions. With the purpose of overcoming this problem, a previous work developed a technique that standardizes the scores, by replacing the computed score by an adjusted score (computed through a training), which is more intuitive for the user and can be combined in the process of record matching. This work was developed by a Phd student from the UFRGS database research group, and is referred to as MeaningScore (DORNELES et al., 2007). The present work intends to study and perform an experimental evaluation of this technique. As the validation shows, it is possible to say that the usage of the MeaningScore approach is valid and return better results. In the process of record matching, where distinct similarity must be combined, the usage of the adjusted score produces results with higher quality. Armazenamento : Dados Banco : Dados Métricas : Similaridade Consulta : Similaridade Similarity querying Data integration Data cleaning Record matching Adjusted score Data quality
34	Ambiente independente de idioma para suporte a identificação de tuplas duplicadas por meio da similaridade fonética e numérica: otimização de algoritmo baseado em multithreading / Andrade, Tiago Luís de. January 2011 (has links) Resumo: Com o objetivo de garantir maior confiabilidade e consistência dos dados armazenados em banco de dados, a etapa de limpeza de dados está situada no início do processo de Descoberta de Conhecimento em Base de Dados (Knowledge Discovery in Database - KDD). Essa etapa tem relevância significativa, pois elimina problemas que refletem fortemente na confiabilidade do conhecimento extraído, como valores ausentes, valores nulos, tuplas duplicadas e valores fora do domínio. Trata-se de uma etapa importante que visa a correção e o ajuste dos dados para as etapas posteriores. Dentro dessa perspectiva, são apresentadas técnicas que buscam solucionar os diversos problemas mencionados. Diante disso, este trabalho tem como metodologia a caracterização da detecção de tuplas duplicadas em banco de dados, apresentação dos principais algoritmos baseados em métricas de distância, algumas ferramentas destinadas para tal atividade e o desenvolvimento de um algoritmo para identificação de registros duplicados baseado em similaridade fonética e numérica independente de idioma, desenvolvido por meio da funcionalidade multithreading para melhorar o desempenho em relação ao tempo de execução do algoritmo. Os testes realizados demonstram que o algoritmo proposto obteve melhores resultados na identificação de registros duplicados em relação aos algoritmos fonéticos existentes, fato este que garante uma melhor limpeza da base de dados / Abstract: In order to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Database - KDD. This step has significant importance because it eliminates problems that strongly reflect the reliability of the knowledge extracted as missing values, null values, duplicate tuples and values outside the domain. It is an important step aimed at correction and adjustment for the subsequent stages. Within this perspective, techniques are presented that seek to address the various problems mentioned. Therefore, this work is the characterization method of detecting duplicate tuples in the database, presenting the main algorithms based on distance metrics, some tools designed for such activity and the development of an algorithm to identify duplicate records based on phonetic similarity numeric and language-independent, developed by multithreading functionality to improve performance over the runtime of the algorithm. Tests show that the proposed algorithm achieved better results in identifying duplicate records regarding phonetic algorithms exist, a fact that ensures better cleaning of the database / Orientador: Carlos Roberto Valêncio / Coorientador: Maurizio Babini / Banca: Pedro Luiz Pizzigatti Corrêa / Banca: José Márcio Machado / Mestre Banco de dados - Gerencia. Armazenamento de dados. Algoritmos de computador. Fonética. Data cleaning. eng Duplicate tuples. eng Phonetics. eng Multithreading. eng
35	Realised stochastic volatility in practice / Model realizované stochastické volatility v praxi Vavruška, Marek January 2012 (has links) Realised Stochastic Volatility model of Koopman and Scharth (2011) is applied to the five stocks listed on NYSE in this thesis. Aim of this thesis is to investigate the effect of speeding up the trade data processing by skipping the cleaning rule requiring the quote data. The framework of the Realised Stochastic Volatility model allows the realised measures to be biased estimates of the integrated volatility, which further supports this approach. The number of errors in recorded trades has decreased significantly during the past years. Different sample lengths were used to construct one day-ahead forecasts of realised measures to examine the forecast precision sensitivity to the rolling window length. Use of the longest window length does not lead to the lowest mean square error. The dominance of the Realised Stochastic Volatility model in terms of the lowest mean square errors of one day-ahead out-of-sample forecasts has been confirmed.
36	Zpracování obchodních dat finančního trhu / Forex Data Processing Olejník, Tomáš January 2011 (has links) The master's thesis' objective is to study basics of high-frequency trading, especially trading at foreign exchange market. Project deals with foreign exchange data preprocessing, fundamentals of market data collecting, data storing and cleaning are discussed. Doing decisions based on poor quality data can lead into fatal consequences in money business therefore data cleaning is necessary. The thesis describes adaptive data cleaning algorithm which is able to adapt current market conditions. According to design a modular plug-in application for data collecting, storing and following cleaning has been implemented.
37	An Analysis of Data Cleaning Tools : A comparative analysis of the performance and effectiveness of data cleaning tools Stenegren, Filip January 2023 (has links) I en värld full av data är felaktiga eller inkonsekventa data oundvikliga, och datarensning, en process som rensar sådana skillnader, blir avgörande. Syftet med studien är att besvara frågan om vilka kriterier datarengöringsverktyg kan jämföras och utvärderas med. Samt att genomföra en jämförande analys av två datarengöringsverktyg, varav ett utvecklades för ändamålet med denna studie medan det andra tillhandahölls för studien. Analysens resultat bör svara på frågan om vilket av verktygen som är överlägset och i vilka avseenden. De resulterande kriterierna för jämförelse är exekveringstid, mängden RAM (Random Access Memory) och CPU (Central Processing Unit) som används, skalbarhet och användarupplevelse. Genom systematisk testning och utvärdering överträffade det utvecklade verktyget i effektivitetskriterier som tidmätning och skalbarhet, det har också en liten fördel när det gäller resursförbrukning. Men eftersom det tillhandahållna verktyget erbjuder ett GUI (Graphical User Interface) finns det inte ett definitivt svar på vilket verktyg som är överlägset eftersom användarupplevelse och behov kan väga över alla tekniska färdigheter. Således kan slutsatsen om vilket verktyg som är överlägset variera, beroende på användarens specifika behov. / In a world teeming with data, faulty or inconsistent data is inevitable, and data cleansing, a process that purges such discrepancies, becomes crucial. The purpose of the study is to answer the question of what criteria data cleaning tools can be compared and evaluated with. As well as undergoing a comparative analysis of two data cleansing tools, one of which is developed for the purpose of this study whereas the other was provided for the study. The result of the analysis should answer the question of which of the tools is superior and in what regard. The resulting criteria for comparison are execution time, amount of RAM (Random Access Memory) and CPU (Central Processing Unit) usage, scalability and user experience. Through systematic testing and evaluation, the developed tool outperformed in efficiency criteria like time measurement and scalability, it also has a slight edge over on resource consumption. However, because the provided tool offers a GUI (Graphical User Interface), there is no definitive answer as to which tool is superior as user experience and needs can outweigh any technical prowess. Thus, the conclusion as to which tool is superior may vary, depending on the specific needs of the user. Data Cleansing Data Cleaning Python Excel Regular-Expression Datarensning Datatvätt Python Excel Regular-Expression Software Engineering Programvaruteknik
38	The Effectiveness of Warnings at Reducing the Prevalence of Insufficient Effort Responding Blackmore, Caitlin E. 19 December 2014 (has links) No description available. Psychological Tests Psychology survey research psychometrics careless response random response inconsistent response data screening data cleaning online surveys
39	Creation of a Time-Series Data Cleaning Toolbox Kovács, Márton January 2024 (has links) A significant drawback of currently used data cleaning methods includes a reliance on domain knowledge or a background in data science, and with the vast number of possible solutions to this problem, the step of data cleaning may be entirely foregone when developing a machine learning (ML) model. Since skipping this stage altogether results in a lower performance for ML models, a general-purpose time-series data cleaning user interface (UI) was developed in Python [1], with a target user base of people unfamiliar with data cleaning. Following the development, the UI was tested on time-series datasets available in online repositories, and a comparison between the estimation performance between ML models trained on original datasets and datasets cleaned through the UI was carried out. This comparison showed that the use of the UI can result in significant improvements to the performance of ML models; however, the degree of said improvement is highly dataset dependent. / En betydande nackdel med de närvarande metoderna som används för datarensning är att lita på domänkunskap eller en bakgrund inom datavetenskap. Med det stora antalet möjliga lösningar på detta problem kan datarensning steget helt utelämnas när en maskininlärningsmodell (ML) utvecklas. Eftersom att hoppa över det här steget resulterar i en lägre prestanda för ML-modeller, utvecklades ett allmänt användargränssnitt för datarensning av tidsserier (UI) i Python [1] som kan bli använda av personer som inte är bekanta med datarensning. Användargränssnittet testades på tidsseriedatauppsättningar som finns tillgängliga i onlinearkiv, och en jämförelse av uppskattningsprestanda mellan ML-modeller som tränats på ursprungliga datauppsättningar och datauppsättningar som rensats via användargränssnittet genomfördes. Denna jämförelse visade att användningen av användargränssnittet kan resultera i betydande förbättringar av ML-modellernas prestanda men förbättringsgraden är datamängdsberoende. Data cleaning Machine learning Time-series data User interface Datarensning Maskininlärning Tidsseriedata Användargränssnitt Engineering and Technology Teknik och teknologier
40	Řízení kvality dat v malých a středních firmách / Data quality management in small and medium enterprises Zelený, Pavel January 2010 (has links) This diploma thesis deals with the data quality management. There are many tools and methodologies to support the data quality management even in Czech market but they are all only for large companies. Small and middle companies can't afford them because of high cost. The first goal of this thesis is to summarize principles of the methodologies and then on the base of the methodologies to suggest more simple methodology for small and middle companies. In the second part of thesis is created and adapted the methodology for a specific company. The first step is to choose the data area of interest in the company. Because of impossibility to buy a software tool to clean data, there are defined relatively simple rules which are base source to create cleaning scripts in SQL language. The scripts are used for automatic data cleaning. On the base of next analyze is decided what data should be cleaned manually. In the next step are described recommendations how to remove duplicities from the database. There is used a functionality of the company's production system. The last step of the methodology is to create a control mechanism which have to keep the required data quality in future. At the end of thesis is made a data research in four data sources. All these sources are from companies using the same production system. The reason of research is to present the overview of data quality and to help with decision about cleaning data in the companies also.

Search results