Global ETD Search

11	Enriquecimento de dados: uma pré-etapa em relação à limpeza de dados Carreira , Juliano Augusto [UNESP] 12 July 2012 (has links) (PDF) Made available in DSpace on 2014-06-11T19:24:01Z (GMT). No. of bitstreams: 0 Previous issue date: 2012-07-12Bitstream added on 2014-06-13T18:20:04Z : No. of bitstreams: 1 carreira_ja_me_sjrp.pdf: 438099 bytes, checksum: d4a3de381d717416cf913583222eee97 (MD5) / A incidência de tuplas duplicadas é um problema significativo e inerente às grandes bases de dados atuais. Trata-se da repetição de registros que, na maioria das vezes, são representados de formas diferentes nas bases de dados, mas fazem referência a uma mesma entidade do mundo real, tornando, assim, a tarefa de identificação das duplicatas um trabalho árduo. As técnicas designadas para o tratamento deste tipo de problema são geralmente genéricas. Isso significa que não levam em consideração as características particulares dos idiomas o que, de certa forma, inibe a maximização quantitativa e qualitativa das tuplas duplicadas identificadas. Este trabalho propõe a criação de uma pré-etapa – intitulada “enriquecimento” – referente ao processo de identificação de tuplas duplicadas. Tal processo baseia-se no favorecimento do idioma e se dá por meio da utilização de regras de linguagem pré-definidas, de forma genérica, para cada idioma desejado. Assim, consegue-se enriquecer os registros de entrada, definidos em qualquer idioma, e, com a aproximação ortográfica que o enriquecimento proporciona, consegue-se aumentar a quantidade de tuplas duplicadas e/ou melhorar o nível de confiança em relação aos pares de tuplas duplicadas identificadas pelo processo / The incidence of duplicate tuples is a significant problem inherent in current large databases. It is the repetition of records that, in most cases, are represented differently in the database but refer to the same real world entity thus making the task of identifying duplicates a hard work. The techniques designed to treat this type of problem are usually generic. That means they do not take into account the particular characteristics of the languages that somehow inhibits the quantitative and qualitative maximization of duplicate tuples identified. This dissertation proposes the creation of a pre-step - called enrichment – in relation to the process of duplicate tuples identification. This process is based on the language favoring and is through the use of predefined language rules in a general way for each language. Thus, it is possible to enrich the input records defined in any language and considering the spell approximation provided by the enrichment process, it is possible to increase the amount of duplicate tuples and/or improve the level of trust in relation to the pairs of duplicate tuples identified by the process Banco de dados - Limpeza Bases de dados - Tuplas duplicadas Databases - Duplicate tuples
12	The Role of Duplicated Code in Software Readability and Comprehension Liao, Xuan, Jiang, Linyao January 2020 (has links) Background. Readability and comprehension are the critical points of software developmentand maintenance. There are many researcher point out that the duplicatecode as a code smell has effect on software maintainability, but lack of research abouthow duplicate code affect software readability and comprehension, which are parts of maintainability. Objectives. In this thesis, we aim to briefly summarize the impact of duplicatecode and typical types of duplicate code according to current works, then our goalis to explore whether duplicate code is a factor to influence readability and comprehension. Methods. In our present research, we did a background survey to asked some background questions from forty-two subjects to help us classify them, and conduct an experiment with subjects to explore the role of duplicate code on perceived readability and comprehension by experiment. The perceived readability and comprehension are measured by perceived readability scale, reading time and the accuracy of cloze test. Results. The experimental data shows code with duplication have higher perceived readability and better comprehension, however, the difference are not significant.And code with duplication cost less reading time than code without duplication,and the difference is significant. But duplication type are strongly associate with perceived readability. For reading time, it is significant associate with duplication type and size of code. While there do not exists significant correlation between programmingexperience of subjects and perceived readability or comprehension, andit also has no significant relation between perceived readability and comprehension,size and CC according to our data results. Conclusions. Code with duplication has higher software readability according tothe results of reading time, which is significant. And code with duplication has highercomprehension than code without duplication, but the difference is not statistically significant according to our experimental results. Longer size of code will increasereading time, and different duplication type also influence the perceived readability,the three duplication types we discussed show these relationship obviously. Duplicate code Software readability Comprehension Experiment Survey Software Engineering Programvaruteknik
13	Enhancing TCP Congestion Control for Improved Performance in Wireless Networks Francis, Breeson 13 September 2012 (has links) Transmission Control Protocol (TCP) designed to deliver seamless and reliable end-to-end data transfer across unreliable networks works impeccably well in wired environment. In fact, TCP carries the around 90% of Internet traffic, so performance of Internet is largely based on the performance of TCP. However, end-to-end throughput in TCP degrades notably when operated in wireless networks. In wireless networks, due to high bit error rate and changing level of congestion, retransmission timeouts for packets lost in transmission is unavoidable. TCP misinterprets these random packet losses, due to the unpredictable nature of wireless environment, and the subsequent packet reordering as congestion and invokes congestion control by triggering fast retransmission and fast recovery, leading to underutilization of the network resources and affecting TCP performance critically. This thesis reviews existing approaches, details two proposed systems for better handling in networks with random loss and delay. Evaluation of the proposed systems is conducted using OPNET simulator by comparing against standard TCP variants and with varying number of hops. TCP fast retransmission fast recovery retransmission timeout congestion window duplicate acks
14	As duplicatas virtuais como forma de relativização ao princípio da cartularidade / Virtual duplicates as a form of relativization for the principle of cartularity Micheli, Leonardo Miessa De 28 November 2014 (has links) A presente pesquisa tem por objetivo a análise da duplicata virtual (ou desmaterializada), sob o enfoque científico dos princípios fundamentais do direito cartular, especialmente o desafiado princípio da cartularidade. A sistemática desenvolvida a partir da Lei das Duplicatas na década de 60 do século passado, bem como a evolução comercial e tecnológica intensificada no início do novo milênio, permitiram e estimularam novas formas de utilização e estruturação deste título de crédito de características inovadoras e arrojadas, que de forma recorrente impulsiona a rediscussão e adaptação da teoria geral sobre o instituto de direito cartular. Naturalmente, tal evolução provoca resistências científicas, doutrinárias e jurisprudenciais, o que motiva o escopo da releitura, objetivada nesta dissertação, dos princípios seculares que atuam como pedra fundamental no direito cartular e dos quais decorrem a eficiência e segurança conquistados por estes instrumentos do Direito Comercial. No transcorrer da pesquisa, busca-se uma análise lógico-dedutiva no desenvolvimento evolutivo da duplicata e seu lugar na teoria geral dos títulos de crédito, permitindo, ao final, uma análise empírica e jurisprudencial sobre sua inevitável e tendente utilização por meios eletrônicos em sua forma desmaterializada. / The goal of the present research is the analysis of the virtual (or dematerialized) duplicate based on the scientific focus of the fundamental principles of Cartular Law, specially the challenged Cartulary Principle. The system developed by the Duplicates Law from last centurys sixties decade, as well as the commercial and technological evolution observed in new millenniums beginning, allowed and stimulated new forms of utilization and structure of this innovative and elaborated credit title, that in a recurrent way pushes the rediscussion and adaptation of the general theory about the institute of Cartular Law. Naturally, this evolution provokes resistance from scientific community and court decisions, which motivates the reanalysis, aimed by this dissertation, of the century acclaimed principles that act as fundamental stone of Cartular Law and by which arise the efficiency and security achieved by these Commercial Law instruments. In the development of this research, it is aimed a logical-deductive analysis of Duplicates evolutional process and its place in the Credit Titles General Theory, allowing, in the end, an empirical and Court Decisions analyses about its inevitable and tending utilization in electronic environments and in its dematerialized forms. Credit title (cambiary law) Documentos eletrônicos Duplicata Duplicate Electronic signature Título de crédito
15	Probabilistic Simhash Matching Sood, Sadhan 2011 August 1900 (has links) Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memory. We then carried out an evaluation of our method on a dataset of 70M web documents, and showed that our method works really well. The results indicated that our method could achieve a reduction in space by a factor of 5 while improving the query time by a factor of 4 with a recall of 0.95 for finding all near-duplicates when the dataset is in memory. With the same recall and same reduction in space, we could achieve an improvement in query-time by a factor of 4.5 while finding first the near-duplicate for an in memory dataset. When the dataset was stored on a disk, we could achieve an improvement in performance by 7 times for finding all near-duplicates and by 14 times when finding the first near-duplicate. Hamming distance near-duplicate similarity search ﬁnger- print web crawl clustering web document
16	Designing Cards as a Polymorphic Resource for Online Free to Play Trading Card Games Jonsson, Jerry, Tonegran, Lina January 2013 (has links) Seasoned players of free to play trading card games or players that invest large amount of money in digital or physical trading card games, end up having superfluous cards that hold no value to them. The purpose of this thesis is to create designs that would counter this problem. We analysed a selection of popular games on the market to get a better understanding about the depth of the problem and existing designs and mechanics to counter said problem. With the knowledge gained from the research, we intended design several systems that would give cards a polymorphic value. To validate those designs we decided to conduct qualitative interviews with highly experienced players of the genre. We discovered from our research and interviews that the problem with superfluous cards was larger than we had anticipated, and few games had taken steps to counter the problem. The systems we designed gave cards a polymorphic value, and the designs were proven successful through our validation. Our research and interviews suggest that by implementing polymorphic attributes to cards it could lessen or even remove the problem of superfluous cards, and at the same time increase the sales figures on booster packs. / <p>Jerry Jonsson går speldesign och programmering och Lina Tonegran går speldesign och grafik.</p> Free to play Trading card game Collectible card game Online play Duplicate cards
17	Enhancing TCP Congestion Control for Improved Performance in Wireless Networks Francis, Breeson 13 September 2012 (has links) Transmission Control Protocol (TCP) designed to deliver seamless and reliable end-to-end data transfer across unreliable networks works impeccably well in wired environment. In fact, TCP carries the around 90% of Internet traffic, so performance of Internet is largely based on the performance of TCP. However, end-to-end throughput in TCP degrades notably when operated in wireless networks. In wireless networks, due to high bit error rate and changing level of congestion, retransmission timeouts for packets lost in transmission is unavoidable. TCP misinterprets these random packet losses, due to the unpredictable nature of wireless environment, and the subsequent packet reordering as congestion and invokes congestion control by triggering fast retransmission and fast recovery, leading to underutilization of the network resources and affecting TCP performance critically. This thesis reviews existing approaches, details two proposed systems for better handling in networks with random loss and delay. Evaluation of the proposed systems is conducted using OPNET simulator by comparing against standard TCP variants and with varying number of hops. TCP fast retransmission fast recovery retransmission timeout congestion window duplicate acks
18	Cache-Oblivious Searching and Sorting in Multisets Farzan, Arash January 2004 (has links) We study three problems related to searching and sorting in multisets in the cache-oblivious model: Finding the most frequent element (the mode), duplicate elimination and finally multi-sorting. We are interested in minimizing the cache complexity (or number of cache misses) of algorithms for these problems in the context under which the cache size and block size are unknown. We start by showing the lower bounds in the comparison model. Then we present the lower bounds in the cache-aware model, which are also the lower bounds in the cache-oblivious model. We consider the input multiset of size <i>N</i> with multiplicities <i>N</i><sub>1</sub>,. . . , <i>N<sub>k</sub></i>. The lower bound for the cache complexity of determining the mode is Ω({<i>N</i> over <i>B</i>} log {<i>M</i> over <i>B</i>} {<i>N</i> over <i>fB</i>}) where &fnof; is the frequency of the mode and <i>M</i>, <i>B</i> are the cache size and block size respectively. Cache complexities of duplicate removal and multi-sorting have lower bounds of Ω({<i>N</i> over <i>B</i>} log {<i>M</i> over <i>B</i>} {<i>N</i> over <i>B</i>} - £{<i>k</i> over <i>i</i>}=1{<i>N<sub>i</sub></i> over <i>B</i>}log {<i>M</i> over <i>B</i>} {<i>N<sub>i</sub></i> over <i>B</i>}). We present two deterministic approaches to give algorithms: selection and distribution. The algorithms with these deterministic approaches differ from the lower bounds by at most an additive term of {<i>N</i> over <i>B</i>} loglog <i>M</i>. However, since loglog <i>M</i> is very small in real applications, the gap is tiny. Nevertheless, the ideas of our deterministic algorithms can be used to design cache-aware algorithms for these problems. The algorithms turn out to be simpler than the previously-known cache-aware algorithms for these problems. Another approach to design algorithms for these problems is the probabilistic approach. In contrast to the deterministic algorithms, our randomized cache-oblivious algorithms are all optimal and their cache complexities exactly match the lower bounds. All of our algorithms are within a constant factor of optimal in terms of the number of comparisons they perform. Computer Science Memory Hierarchies Cache-Oblivious model Multisets Determining the mode Duplicate Elimination Sorting
19	Cache-Oblivious Searching and Sorting in Multisets Farzan, Arash January 2004 (has links) We study three problems related to searching and sorting in multisets in the cache-oblivious model: Finding the most frequent element (the mode), duplicate elimination and finally multi-sorting. We are interested in minimizing the cache complexity (or number of cache misses) of algorithms for these problems in the context under which the cache size and block size are unknown. We start by showing the lower bounds in the comparison model. Then we present the lower bounds in the cache-aware model, which are also the lower bounds in the cache-oblivious model. We consider the input multiset of size <i>N</i> with multiplicities <i>N</i><sub>1</sub>,. . . , <i>N<sub>k</sub></i>. The lower bound for the cache complexity of determining the mode is Ω({<i>N</i> over <i>B</i>} log {<i>M</i> over <i>B</i>} {<i>N</i> over <i>fB</i>}) where &fnof; is the frequency of the mode and <i>M</i>, <i>B</i> are the cache size and block size respectively. Cache complexities of duplicate removal and multi-sorting have lower bounds of Ω({<i>N</i> over <i>B</i>} log {<i>M</i> over <i>B</i>} {<i>N</i> over <i>B</i>} - £{<i>k</i> over <i>i</i>}=1{<i>N<sub>i</sub></i> over <i>B</i>}log {<i>M</i> over <i>B</i>} {<i>N<sub>i</sub></i> over <i>B</i>}). We present two deterministic approaches to give algorithms: selection and distribution. The algorithms with these deterministic approaches differ from the lower bounds by at most an additive term of {<i>N</i> over <i>B</i>} loglog <i>M</i>. However, since loglog <i>M</i> is very small in real applications, the gap is tiny. Nevertheless, the ideas of our deterministic algorithms can be used to design cache-aware algorithms for these problems. The algorithms turn out to be simpler than the previously-known cache-aware algorithms for these problems. Another approach to design algorithms for these problems is the probabilistic approach. In contrast to the deterministic algorithms, our randomized cache-oblivious algorithms are all optimal and their cache complexities exactly match the lower bounds. All of our algorithms are within a constant factor of optimal in terms of the number of comparisons they perform. Computer Science Memory Hierarchies Cache-Oblivious model Multisets Determining the mode Duplicate Elimination Sorting
20	Modeling and Querying Uncertainty in Data Cleaning Beskales, George January 2012 (has links) Data quality problems such as duplicate records, missing values, and violation of integrity constrains frequently appear in real world applications. Such problems cost enterprises billions of dollars annually, and might have unpredictable consequences in mission-critical tasks. The process of data cleaning refers to detecting and correcting errors in data in order to improve the data quality. Numerous efforts have been taken towards improving the effectiveness and the efficiency of the data cleaning. A major challenge in the data cleaning process is the inherent uncertainty about the cleaning decisions that should be taken by the cleaning algorithms (e.g., deciding whether two records are duplicates or not). Existing data cleaning systems deal with the uncertainty in data cleaning decisions by selecting one alternative, based on some heuristics, while discarding (i.e., destroying) all other alternatives, which results in a false sense of certainty. Furthermore, because of the complex dependencies among cleaning decisions, it is difficult to reverse the process of destroying some alternatives (e.g., when new external information becomes available). In most cases, restarting the data cleaning from scratch is inevitable whenever we need to incorporate new evidence. To address the uncertainty in the data cleaning process, we propose a new approach, called probabilistic data cleaning, that views data cleaning as a random process whose possible outcomes are possible clean instances (i.e., repairs). Our approach generates multiple possible clean instances to avoid the destructive aspect of current cleaning systems. In this dissertation, we apply this approach in the context of two prominent data cleaning problems: duplicate elimination, and repairing violations of functional dependencies (FDs). First, we propose a probabilistic cleaning approach for the problem of duplicate elimination. We define a space of possible repairs that can be efficiently generated. To achieve this goal, we concentrate on a family of duplicate detection approaches that are based on parameterized hierarchical clustering algorithms. We propose a novel probabilistic data model that compactly encodes the defined space of possible repairs. We show how to efficiently answer relational queries using the set of possible repairs. We also define new types of queries that reason about the uncertainty in the duplicate elimination process. Second, in the context of repairing violations of FDs, we propose a novel data cleaning approach that allows sampling from a space of possible repairs. Initially, we contrast the existing definitions of possible repairs, and we propose a new definition of possible repairs that can be sampled efficiently. We present an algorithm that randomly samples from this space, and we present multiple optimizations to improve the performance of the sampling algorithm. Third, we show how to apply our probabilistic data cleaning approach in scenarios where both data and FDs are unclean (e.g., due to data evolution or inaccurate understanding of the data semantics). We propose a framework that simultaneously modifies the data and the FDs while satisfying multiple objectives, such as consistency of the resulting data with respect to the resulting FDs, (approximate) minimality of changes of data and FDs, and leveraging the trade-off between trusting the data and trusting the FDs. In presence of uncertainty in the relative trust in data versus FDs, we show how to extend our cleaning algorithm to efficiently generate multiple possible repairs, each of which corresponds to a different level of relative trust. Data Cleaning Duplicate Elimination Functional Dependency Violation Probabilistic Cleaning Computer Science

Search results