• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 62
  • 17
  • 8
  • 5
  • 4
  • 2
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 117
  • 20
  • 18
  • 17
  • 15
  • 14
  • 14
  • 13
  • 13
  • 13
  • 13
  • 12
  • 12
  • 12
  • 12
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Normalização textual de conteúdo gerado por usuário / User-generated content text normalization

Thales Felipe Costa Bertaglia 18 August 2017 (has links)
Conteúdo Gerado por Usuário (CGU) é a denominação dada ao conteúdo criado de forma espontânea por indivíduos comuns, sem vínculos com meios de comunicação. Esse tipo de conteúdo carrega informações valiosas e pode ser explorado por diversas áreas do conhecimento. Muito do CGU é disponibilizado em forma de textos avaliações de produtos, comentários em fóruns sobre filmes e discussões em redes sociais são exemplos. No entanto, a linguagem utilizada em textos de CGU diverge, de várias maneiras, da norma culta da língua, dificultando seu processamento por técnicas de PLN. A linguagem de CGU é fortemente ligada à língua utilizada no cotidiano, contendo, assim, uma grande quantidade de ruídos. Erros ortográficos, abreviações, gírias, ausência ou mau uso de pontuação e de capitalização são alguns ruídos que dificultam o processamento desses textos. Diversos trabalhos relatam perda considerável de desempenho ao testar ferramentas do estado-daarte de PLN em textos de CGU. A Normalização Textual é o processo de transformar palavras ruidosas em palavras consideradas corretas e pode ser utilizada para melhorar a qualidade de textos de CGU. Este trabalho relata o desenvolvimento de métodos e sistemas que visam a (a) identificar palavras ruidosas em textos de CGU, (b) encontrar palavras candidatas a sua substituição, e (c) ranquear os candidatos para realizar a normalização. Para a identificação de ruídos, foram propostos métodos baseados em léxicos e em aprendizado de máquina, com redes neurais profundas. A identificação automática apresentou resultados comparáveis ao uso de léxicos, comprovando que este processo pode ser feito com baixa dependência de recursos. Para a geração e ranqueamento de candidatos, foram investigadas técnicas baseadas em similaridade lexical e word embeddings. Concluiu-se que o uso de word embeddings é altamente adequado para normalização, tendo atingido os melhores resultados. Todos os métodos propostos foram avaliados com base em um córpus de CGU anotado no decorrer do projeto, contendo textos de diferentes origens: fóruns de discussão, reviews de produtos e publicações no Twitter. Um sistema, Enelvo, combinando todos os métodos foi implementado e comparado a um outro sistema normalizador existente, o UGCNormal. Os resultados obtidos pelo sistema Enelvo foram consideravelmente superiores, com taxa de correção entre 67% e 97% para diferentes tipos de ruído, com menos dependência de recursos e maior flexibilidade na normalização. / User Generated Content (UGC) is the name given to content created spontaneously by ordinary individuals, without connections to the media. This type of content carries valuable information and can be exploited by several areas of knowledge. Much of the UGC is provided in the form of texts product reviews, comments on forums about movies, and discussions on social networks are examples. However, the language used in UGC texts differs, in many ways, from the cultured norm of the language, making it difficult for NLP techniques to handle them. UGC language is strongly linked to the language used in daily life, containing a large amount of noise. Spelling mistakes, abbreviations, slang, absence or misuse of punctuation and capitalization are some noises that make it difficult to process these texts. Several works report considerable loss of performance when testing NLP state-of-the-art tools in UGC texts. Textual Normalization is the process of turning noisy words into words considered correct and can be used to improve the quality of UGC texts. This work reports the development of methods and systems that aim to (a) identify noisy words in UGC, (b) find candidate words for substitution, and (c) rank candidates for normalization. For the identification of noisy words, lexical-based methods and machine learning ones using deep neural networks were proposed. The automatic identification presented results comparable to the use of lexicons, proving that this process can be done with low dependence of resources. For the generation and ranking of candidates, techniques based on lexical similarity and word embeddings were investigated. It was concluded that the use of embeddings is highly suitable for normalization, having achieved the best results. All proposed methods were evaluated based on a UGC corpus annotated throughout the project, containing texts from different sources: discussion forums, product reviews and tweets. A system, Enelvo, combining all methods was implemented and compared to another existing normalizing system, UGCNormal. The results obtained by the Enelvo system were considerably higher, with a correction rate between 67 % and 97 % for different types of noise, with less dependence on resources and greater flexibility in normalization.
52

Parallel implementation of curve reconstruction from noisy samples

Randrianarivony, Maharavo, Brunnett, Guido 06 April 2006 (has links)
This paper is concerned with approximating noisy samples by non-uniform rational B-spline curves with special emphasis on free knots. We show how to set up the problem such that nonlinear optimization methods can be applied efficiently. This involves the introduction of penalizing terms in order to avoid undesired knot positions. We report on our implementation of the nonlinear optimization and we show a way to implement the program in parallel. Parallel performance results are described. Our experiments show that our program has a linear speedup and an efficiency value close to unity. Runtime results on a parallel computer are displayed.
53

Parallel implementation of surface reconstruction from noisy samples

Randrianarivony, Maharavo, Brunnett, Guido 06 April 2006 (has links)
We consider the problem of reconstructing a surface from noisy samples by approximating the point set with non-uniform rational B-spline surfaces. We focus on the fact that the knot sequences should also be part of the unknown variables that include the control points and the weights in order to find their optimal positions. We show how to set up the free knot problem such that constrained nonlinear optimization can be applied efficiently. We describe in detail a parallel implementation of our approach that give almost linear speedup. Finally, we provide numerical results obtained on the Chemnitzer Linux Cluster supercomputer.
54

A Rule-Based Normalization System for Greek Noisy User-Generated Text

Toska, Marsida January 2020 (has links)
The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. Therefore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic text processing with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard word forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance (Levenshtein distance approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization.
55

Learning from noisy labelsby importance reweighting: : a deep learning approach

Fang, Tongtong January 2019 (has links)
Noisy labels could cause severe degradation to the classification performance. Especially for deep neural networks, noisy labels can be memorized and lead to poor generalization. Recently label noise robust deep learning has outperformed traditional shallow learning approaches in handling complex input data without prior knowledge of label noise generation. Learning from noisy labels by importance reweighting is well-studied. Existing work in this line using deep learning failed to provide reasonable importance reweighting criterion and thus got undesirable experimental performances. Targeting this knowledge gap and inspired by domain adaptation, we propose a novel label noise robust deep learning approach by importance reweighting. Noisy labeled training examples are weighted by minimizing the maximum mean discrepancy between the loss distributions of noisy labeled and clean labeled data. In experiments, the proposed approach outperforms other baselines. Results show a vast research potential of applying domain adaptation in label noise problem by bridging the two areas. Moreover, the proposed approach potentially motivate other interesting problems in domain adaptation by enabling importance reweighting to be used in deep learning. / Felaktiga annoteringar kan sänka klassificeringsprestanda.Speciellt för djupa nätverk kan detta leda till dålig generalisering. Nyligen har brusrobust djup inlärning överträffat andra inlärningsmetoder när det gäller hantering av komplexa indata Befintligta resultat från djup inlärning kan dock inte tillhandahålla rimliga viktomfördelningskriterier. För att hantera detta kunskapsgap och inspirerat av domänanpassning föreslår vi en ny robust djup inlärningsmetod som använder omviktning. Omviktningen görs genom att minimera den maximala medelavvikelsen mellan förlustfördelningen av felmärkta och korrekt märkta data. I experiment slår den föreslagna metoden andra metoder. Resultaten visar en stor forskningspotential för att tillämpa domänanpassning. Dessutom motiverar den föreslagna metoden undersökningar av andra intressanta problem inom domänanpassning genom att möjliggöra smarta omviktningar.
56

Machine Learning for Improving Detection of Cooling Complications : A case study / Maskininlärning för att förbättra detektering av kylproblem

Bruksås Nybjörk, William January 2022 (has links)
The growing market for cold chain pharmaceuticals requires reliable and flexible logistics solutions that ensure the quality of the drugs. These pharmaceuticals must maintain cool to retain the function and effect. Therefore, it is of greatest concern to keep these drugs within the specified temperature interval. Temperature controllable containers are a common logistic solution for cold chain pharmaceuticals freight. One of the leading manufacturers of these containers provides lease and shipment services while also regularly assessing the cooling function. A method is applied for detecting cooling issues and preventing impaired containers to be sent to customers. However, the method tends to miss-classify containers, missing some faulty containers while also classifying functional containers as faulty. This thesis aims to investigate and identify the dependent variables associated with the cooling performance, then Machine Learning will be performed for evaluating if recall and precision could be improved. An improvement could lead to faster response, less waste and even more reliable freight which could be vital for both companies and patients. The labeled dataset has a binary outcome (no cooling issues, cooling issues) and is heavily imbalanced since the containers have high quality and undergo frequent testing and maintenance. Therefore, just a small amount has cooling issues. After analyzing the data, extensive deviations were identified which suggested that the labeled data was misclassified. The believed misclassification was corrected and compared to the original data. A Random Forest classifier in combination with random oversampling and threshold tuning resulted in the best performance for the corrected class labels. Recall reached 86% and precision 87% which is a very promising result. A Random Forest classifier in combination with random oversampling resulted in the best score for the original class labels. Recall reached 77% and precision 44% which is much lower than the adjusted class labels but still displayed a valid result in context of the believed extent of misclassification. Power output variables, compressor error variables and standard deviation of inside temperature were found clear connection toward cooling complications. Clear links could also be found to the critical cases where set temperature could not be met. These cases could therefore be easily detected but harder to prevent since they often appeared without warning. / Den växande marknaden för läkemedel beroende av kylkedja kräver pålitliga och agila logistiska lösningar som försäkrar kvaliteten hos läkemedlen. Dessa läkemedel måste förbli kylda för att behålla funktion och effekt. Därför är det av största vikt att hålla läkemedlen inom det angivna temperaturintervallet. Temperaturkontrollerade containrar är en vanlig logistisk lösning vid kylkedjefrakt av läkemedel. En av de ledande tillverkarna av dessa containrar tillhandahåller uthyrning och frakttjänster av dessa medan de också regelbundet bedömer containrarnas kylfunktion. En metod används för att detektera kylproblem och förhindra skadade containrar från att nå kund. Dock så tenderar denna metod att missklassificera containrar genom att missa vissa containrar med kylproblem och genom att klassificera fungerande containrar som skadade. Den här uppsatsen har som syfte att identifiera beroende variabler kopplade mot kylprestandan och därefter undersöka om maskininlärning kan användas för att förbättra återkallelse och precisionsbetyg gällande containrar med kylproblem. En förbättring kan leda till snabbare respons, mindre resursslöseri och ännu pålitligare frakt vilket kan vara vitalt för både företag som patienter. Ett märkt dataset tillhandahålls och detta har ett binärt utfall (inga kylproblem, kylproblem). Datasetet är kraftigt obalanserat då containrar har en hög kvalité och genomgår frekvent testning och underhåll. Därför har enbart en liten del av containrarna kylproblem. Efter att ha analyserat datan så kunde omfattande avvikelser upptäckas vilket antydde på grov miss-klassificering. Den trodda missklassificeringen korrigerades och jämfördes med den originella datan. En Random Forest klassificerare i kombination med slumpmässig översampling och tröskeljustering gav det bästa resultatet för det korrigerade datasetet. En återkallelse på 86% och en precision på 87% nåddes, vilket var ett lovande resultat. En Random Forest klassificerare i kombination med slumpmässig översampling gav det bästa resultatet för det originella datasetet. En återkallelse på 77% och en precision på 44% nåddes. Detta var mycket lägre än det justerade datasetet men det presenterade fortfarande godkända resultat med åtanke på den trodda missklassificeringen. Variabler baserade på uteffekt, kompressorfel och standardavvikelse av innetemperatur hade tydliga kopplingar mot kylproblem. Tydliga kopplingar kunde även identifieras bland de kritiska fallen där temperaturen ej kunde bibehållas. Dessa fall kunde därmed lätt detekteras men var svårare att förhindra då dessa ofta uppkom utan förvarning.
57

[en] RDS - RECOVERING DISCARDED SAMPLES WITH NOISY LABELS: TECHNIQUES FOR TRAINING DEEP LEARNING MODELS WITH NOISY SAMPLES / [pt] RDS - RECUPERANDO AMOSTRAS DESCARTADAS COM RÓTULOS RUIDOSOS: TÉCNICAS PARA TREINAMENTO DE MODELOS DE DEEP LEARNING COM AMOSTRAS RUIDOSAS

VITOR BENTO DE SOUSA 20 May 2024 (has links)
[pt] Modelos de Aprendizado Profundo para classificação de imagens alcançaram o estado da arte em um vasto campo de aplicações. Entretanto, é frequente deparar-se com amostras ruidosas, isto é, amostras contendo rótulos incorretos, nos conjuntos de dados provenientes de aplicações do mundo real. Quando modelos de Aprendizado Profundo são treinados nestes conjuntos de dados, a sua performance é prejudicada. Modelos do estado da arte, como Co-teaching+ e Jocor, utilizam a técnica Small Loss Approach (SLA) para lidar com amostras ruidosas no cenário multiclasse. Nesse trabalho, foi desenvolvido uma nova técnica para lidar com amostras ruidosas, chamada Recovering Discarded Samples (RDS), que atua em conjunto com a SLA. Para demostrar a eficácia da técnica, aplicou-se o RDS nos modelos Co-teaching+ e Jocor resultando em dois novos modelos RDS-C e RDS-J. Os resultados indicam ganhos de até 6 por cento nas métricas de teste para ambos os modelos. Um terceiro modelo chamado RDS-Contrastive também foi desenvolvido, este modelo superou o estado da arte em até 4 por cento na acurácia de teste. Além disso, nesse trabalho, expandiu-se a técnica SLA para o cenário multilabel, sendo desenvolvido a técnica SLA Multilabel (SLAM). Com essa técnica foi desenvolvido mais dois modelos para cenário multilabel com amostras ruidosas. Os modelos desenvolvidos nesse trabalho para multiclasse foram utilizados em um problema real de cunho ambiental. Os modelos desenvolvidos para o cenário multilabel foram aplicados como solução para um problema real na área de óleo e gás. / [en] Deep Learning models designed for image classification have consistently achieved state-of-the-art performance across a plethora of applications. However, the presence of noisy samples, i.e., instances with incorrect labels, is a prevalent challenge in datasets derived from real-world applications. The training of Deep Learning models on such datasets inevitably compromises their performance. State-of-the-art models, such as Co-teaching+ and Jocor, utilize the Small Loss Approach (SLA) technique to handle noisy samples in a multi-class scenario. In this work, a new technique named Recovering Discarded Samples (RDS) was developed to address noisy samples, working with SLA. To demonstrate the effectiveness of the technique, RDS was applied to the Co-teaching+ and Jocor models, resulting in two new models, RDS-C and RDS-J. The results indicate gains of up to 6 percent in test metrics for both models. A third model, named RDS-Contrastive, was also developed, surpassing the state-of-the-art by up to 4 percent in test accuracy. Furthermore, this work extended the SLA technique to the multilabel scenario, leading to the development of the SLA Multilabel (SLAM) technique. With this technique, two additional models for the multilabel scenario with noisy samples were developed. The models proposed in this work for the multiclass scenario were applied in a real-world environmental solution, while the models developed for the multilabel scenario were implemented as a solution for a real problem in the oil and gas industry.
58

ON THE CONVERGENCE AND APPLICATIONS OF MEAN SHIFT TYPE ALGORITHMS

Aliyari Ghassabeh, Youness 01 October 2013 (has links)
Mean shift (MS) and subspace constrained mean shift (SCMS) algorithms are non-parametric, iterative methods to find a representation of a high dimensional data set on a principal curve or surface embedded in a high dimensional space. The representation of high dimensional data on a principal curve or surface, the class of mean shift type algorithms and their properties, and applications of these algorithms are the main focus of this dissertation. Although MS and SCMS algorithms have been used in many applications, a rigorous study of their convergence is still missing. This dissertation aims to fill some of the gaps between theory and practice by investigating some convergence properties of these algorithms. In particular, we propose a sufficient condition for a kernel density estimate with a Gaussian kernel to have isolated stationary points to guarantee the convergence of the MS algorithm. We also show that the SCMS algorithm inherits some of the important convergence properties of the MS algorithm. In particular, the monotonicity and convergence of the density estimate values along the sequence of output values of the algorithm are shown. We also show that the distance between consecutive points of the output sequence converges to zero, as does the projection of the gradient vector onto the subspace spanned by the D-d eigenvectors corresponding to the D-d largest eigenvalues of the local inverse covariance matrix. Furthermore, three new variations of the SCMS algorithm are proposed and the running times and performance of the resulting algorithms are compared with original SCMS algorithm. We also propose an adaptive version of the SCMS algorithm to consider the effect of new incoming samples without running the algorithm on the whole data set. As well, we develop some new potential applications of the MS and SCMS algorithm. These applications involve finding straight lines in digital images; pre-processing data before applying locally linear embedding (LLE) and ISOMAP for dimensionality reduction; noisy source vector quantization where the clean data need to be estimated before the quanization step; improving the performance of kernel regression in certain situations; and skeletonization of digitally stored handwritten characters. / Thesis (Ph.D, Mathematics & Statistics) -- Queen's University, 2013-09-30 18:01:12.959
59

Alignement de phrases parallèles dans des corpus bruités

Lamraoui, Fethi 07 1900 (has links)
La traduction statistique requiert des corpus parallèles en grande quantité. L’obtention de tels corpus passe par l’alignement automatique au niveau des phrases. L’alignement des corpus parallèles a reçu beaucoup d’attention dans les années quatre vingt et cette étape est considérée comme résolue par la communauté. Nous montrons dans notre mémoire que ce n’est pas le cas et proposons un nouvel aligneur que nous comparons à des algorithmes à l’état de l’art. Notre aligneur est simple, rapide et permet d’aligner une très grande quantité de données. Il produit des résultats souvent meilleurs que ceux produits par les aligneurs les plus élaborés. Nous analysons la robustesse de notre aligneur en fonction du genre des textes à aligner et du bruit qu’ils contiennent. Pour cela, nos expériences se décomposent en deux grandes parties. Dans la première partie, nous travaillons sur le corpus BAF où nous mesurons la qualité d’alignement produit en fonction du bruit qui atteint les 60%. Dans la deuxième partie, nous travaillons sur le corpus EuroParl où nous revisitons la procédure d’alignement avec laquelle le corpus Europarl a été préparé et montrons que de meilleures performances au niveau des systèmes de traduction statistique peuvent être obtenues en utilisant notre aligneur. / Current statistical machine translation systems require parallel corpora in large quantities, and typically obtain such corpora through automatic alignment at the sentence level: a text and its translation . The alignment of parallel corpora has received a lot of attention in the eighties and is largely considered to be a solved problem in the community. We show that this is not the case and propose an alignment technique that we compare to the state-of-the-art aligners. Our technique is simple, fast and can handle large amounts of data. It often produces better results than state-of-the-art. We analyze the robustness of our alignment technique across different text genres and noise level. For this, our experiments are divided into two main parts. In the first part, we measure the alignment quality on BAF corpus with up to 60% of noise. In the second part, we use the Europarl corpus and revisit the alignment procedure with which it has been prepared; we show that better SMT performance can be obtained using our alignment technique.
60

A novel theoretical and experimental approach permits a systems view on stochastic intracellular Ca 2+ signalling

Thurley, Kevin 30 August 2011 (has links)
Ca(2+)-Ionen sind ein universeller sekundärer Botenstoff in eukaryotischen Zellen und übertragen Information durch wiederholte, kurzzeitige Erhöhungen der cytosolischen Ca(2+)-Konzentration (Ca(2+) Spikes). Ein bekannter Mechanismus, der solche Ca(2+)-Signale erzeugt, beinhaltet die Freisetzung von Ca(2+)-Ionen aus dem endoplasmatischen Retikulum durch IP3-sensitive Kanäle. Puffs sind elementare Ereignisse der Ca(2+)-Freisetzung durch einzelne Cluster von Ca(2+)-Kanälen. Intrazelluläre Ca(2+)-Dynamik ist ein stochastisches System, allerdings konnte bisher keine vollständige stochastische Theorie entwickelt werden. Die vorliegende Dissertation formuliert die Theorie mit Hilfe von Interpuffintervallen und Pufflängen, da diese Größen im Gegensatz zu den Eigenschaften der Einzelkanäle direkt messbar sind. Die Theorie reproduziert das typische Spektrum bekannter Ca(2+)-Signale. Die Signalform und das durchschnittliche Interspikeinterval (ISI) hängen sensitiv von den genauen Eigenschaften und der räumlichen Anordnung der Cluster ab. Im Gegensatz dazu hängt die Beziehung zwischen Mittelwert und Standardabweichung der ISI weder von den Clustereigenschaften noch von der räumlichen Anordnung ab, sondern wird lediglich von globalen Feedbackprozessen im Ca(2+)-Signalweg reguliert. Diese Beziehung ist essentiell für die Funktion des Signalwegs, da sie trotz der Zufälligkeit der ISI eine Frequenzkodierung ermöglicht und den maximalen Informationsgehalt der Spikesequenzen bestimmt. Neben der theoretischen Analyse enthält die vorliegende Arbeit auch experimentelle Puff- und Spikemessungen an lebenden HEK-Zellen, die wichtige Ergebnisse verifizieren. Insgesamt wird durch die integrierte theoretische und experimentelle Untersuchung auf verschiedenen Stufen molekularer Organisation gezeigt, dass stochastische Ca(2+)-Signale verlässliche Informationsträger sind, und dass der Mechanismus durch globalen Feedback an die spezifischen Anforderungen eines Signalpfads angepasst werden kann. / Ca(2+) is a universal second messenger in eukaryotic cells transmitting information through sequences of concentration spikes. A prominent mechanism to generate these spikes involves Ca(2+) release from the endoplasmic reticulum Ca(2+) store via IP3-sensitive channels. Puffs are elemental events of IP3-induced Ca(2+) release through single clusters of channels. Intracellular Ca(2+) dynamics are a stochastic system, but a complete stochastic theory has not been developed yet. As a new concept, this thesis formulates the theory in terms of interpuff interval and puff duration distributions, since unlike the properties of individual channels, they can be measured in vivo. This leads to a non-Markovian description of system dynamics, for which analytical solutions and efficient stochastic simulation techniques are derived. The theory reproduces the typical spectrum of Ca(2+) signals. Signal form and average interspike interval (ISI) depend sensitively on detailed properties and spatial arrangement of clusters. In difference to that, the relation between the average and the standard deviation of ISIs does not depend on cluster properties and cluster arrangement, and it is robust with respect to cell variability. It can only be regulated by global feedback processes in the Ca(2+) signalling pathway. That relation is essential for pathway function, since it ensures frequency encoding despite the randomness of ISIs and determines the maximal spike train information content. Apart from the theoretical investigation, this thesis verifies key results by live cell imaging of Ca(2+) spikes and puffs in HEK cells. Hence, this work comprises a systems level investigation of Ca(2+) signals, integrating data and theory from different levels of molecular organisation. It demonstrates that stochastic Ca(2+) signals can transmit information reliably, and that the mechanism can be adapted to the specific needs of a pathway by global feedback.

Page generated in 0.0972 seconds