Global ETD Search

51	Third-Party TCP Rate Control Bansal, Dushyant January 2005 (has links) The Transmission Control Protocol (TCP) is the dominant transport protocol in today?s Internet. The original design of TCP left congestion control open to future designers. Short of implementing changes to the TCP stack on the end-nodes themselves, Internet Service Providers have employed several techniques to be able to operate their network equipment efficiently. These techniques amount to shaping traffic to reduce cost and improve overall customer satisfaction. <br /><br /> The method that gives maximum control when performing traffic shaping is using an inline traffic shaper. An inline traffic shaper sits in the middle of any flow, allowing packets to pass through it and, with policy-limited freedom, inspects and modifies all packets as it pleases. However, a number of practical issues such as hardware reliability or ISP policy, may prevent such a solution from being employed. For example, an ISP that does not fully trust the quality of the traffic shaper would not want such a product to be placed in-line with its equipment, as it places a significant threat to its business. What is required in such cases is third-party rate control. <br /><br /> Formally defined, a third-party rate controller is one that can see all traffic and inject new traffic into the network, but cannot remove or modify existing network packets. Given these restrictions, we present and study a technique to control TCP flows, namely triple-ACK duplication. The triple-ACK algorithm allows significant capabilities to a third-party traffic shaper. We provide an analytical justification for why this technique works under ideal conditions and demonstrate via simulation the bandwidth reduction achieved. When judiciously applied, the triple-ACK duplication technique produces minimal badput, while producing significant reductions in bandwidth consumption under ideal conditions. Based on a brief study, we show that our algorithm is able to selectively throttle one flow while allowing another to gain in bandwidth. Electrical & Computer Engineering TCP rate control flow control congestion congestion control third-party triple-ACK triple duplicate ACKs ISP Internet Service Provider trust bandwidth management Internet wireline router traffic shaping traffic shaper
52	Third-Party TCP Rate Control Bansal, Dushyant January 2005 (has links) The Transmission Control Protocol (TCP) is the dominant transport protocol in today?s Internet. The original design of TCP left congestion control open to future designers. Short of implementing changes to the TCP stack on the end-nodes themselves, Internet Service Providers have employed several techniques to be able to operate their network equipment efficiently. These techniques amount to shaping traffic to reduce cost and improve overall customer satisfaction. <br /><br /> The method that gives maximum control when performing traffic shaping is using an inline traffic shaper. An inline traffic shaper sits in the middle of any flow, allowing packets to pass through it and, with policy-limited freedom, inspects and modifies all packets as it pleases. However, a number of practical issues such as hardware reliability or ISP policy, may prevent such a solution from being employed. For example, an ISP that does not fully trust the quality of the traffic shaper would not want such a product to be placed in-line with its equipment, as it places a significant threat to its business. What is required in such cases is third-party rate control. <br /><br /> Formally defined, a third-party rate controller is one that can see all traffic and inject new traffic into the network, but cannot remove or modify existing network packets. Given these restrictions, we present and study a technique to control TCP flows, namely triple-ACK duplication. The triple-ACK algorithm allows significant capabilities to a third-party traffic shaper. We provide an analytical justification for why this technique works under ideal conditions and demonstrate via simulation the bandwidth reduction achieved. When judiciously applied, the triple-ACK duplication technique produces minimal badput, while producing significant reductions in bandwidth consumption under ideal conditions. Based on a brief study, we show that our algorithm is able to selectively throttle one flow while allowing another to gain in bandwidth. Electrical & Computer Engineering TCP rate control flow control congestion congestion control third-party triple-ACK triple duplicate ACKs ISP Internet Service Provider trust bandwidth management Internet wireline router traffic shaping traffic shaper
53	Indirect Influence of English on Kiswahili: The Case of Multiword Duplicates between Kiswahili and English Ochieng, Dunlop 22 October 2015 (has links) (PDF) Some proverbs, idioms, nominal compounds, and slogans duplicate in form and meaning between several languages. An example of these between German and English is Liebe auf den ersten Blick and “love at first sight” (Flippo, 2009), whereas, an example between Kiswahili and English is uchaguzi ulio huru na haki and “free and fair election.” Duplication of these strings of words between languages that are as different in descent and typology as Kiswahili and English is irregular. On this ground, Kiswahili academies and a number of experts of Kiswahili assumed – prior to the present study – that the Kiswahili versions of the expressions are the derivatives from their English congruent counterparts. The assumption nonetheless lacked empirical evidence and also discounted other potential causes of the phenomenon, i.e. analogical extension, nativism and cognitive metaphoricalization (Makkai, 1972; Land, 1974; Lakoff & Johnson, 1980b; Ruhlen, 1987; Lakoff, 1987; Gleitman and Newport, 1995). Out of this background, we assumed an academic obligation of empirically investigating what causes this formal and semantic duplication of strings of words (multiword expressions) between English and Kiswahili to a degree beyond chance expectations. In this endeavour, we employed checklist to 24, interview to 43, online questionnaire to 102, translation test to 47 and translationality test to 8 respondents. Online questionnaire respondents were from 21 regions of Tanzania, whereas, those of the rest of the tools were from Zanzibar, Dar es Salaam, Pwani, Lindi, Dodoma and Kigoma. Complementarily, we analysed the Chemnitz Corpus of Swahili (CCS), the Helsinki Swahili Corpus (HSC), and the Corpus of Contemporary American English (COCA) for clues on the sources and trends of expressions exhibiting this characteristic between Kiswahili and English. Furthermore, we reviewed the Bible, dictionaries, encyclopaedia, books, articles, expressions lists, wikis, and phrase books in pursuit of etymologies, and histories of concepts underlying the focus expressions. Our analysis shows that most of the Kiswahili versions of the focus expressions are the function of loan translation and rendition from English. We found that economic, political and technological changes, mostly induced by liberalization policy of the 1990s in Tanzania, created lexical gaps in Kiswahili that needed to be filled. We discovered that Kiswahili, among other means, fill such gaps through loan translation and loan rendition of English phrases. Prototypical examples of notions whose English labels Kiswahili has translated word for word are such as “human rights”, “free and fair election”, “the World Cup” and “multiparty democracy”. We can conclude that Kiswahili finds it easier and economical to translate the existing English labels for imported notions rather than innovating original labels for the concepts. Even so, our analysis revealed that a few of the Kiswahili duplicate multiword expressions might be a function of nativism, cognitive metaphoricalization and analogy phenomena. We, for instance, observed that formulation of figurative meanings follow more or less similar pattern across human languages – the secondary meanings deriving from source domains. As long as the source domains are common in many human\'s environment, we found it plausible for certain multiword expressions to spontaneously duplicate between several human languages. Academically, our study has demonstrated how multiword expressions, which duplicate between several languages, can be studied using primary data, corpora, documentary review and observation. In particular, the study has designed a framework for studying sources of the expressions and even terminologies for describing the phenomenon. What\'s more, the study has collected a number of expressions that duplicate between Kiswahili and English languages, which other researchers can use in similar studies. English Kiswahili Duplicate multiword expressions English on Kiswahili Loan expressionsi Indirect loan influence Widespread expressions Anglicism Englishzitation Borrowing of expressions Tanzania borrowings morphosyntax multiword expressions Swahili Englisch Tansania Morphosyntax Lehnwörter Mehrwortausdrücke ddc:820 Englisch Swahili Morphosyntax Tansania
54	Free-text Informed Duplicate Detection of COVID-19 Vaccine Adverse Event Reports Turesson, Erik January 2022 (has links) To increase medicine safety, researchers use adverse event reports to assess causal relationships between drugs and suspected adverse reactions. VigiBase, the world's largest database of such reports, collects data from numerous sources, introducing the risk of several records referring to the same case. These duplicates negatively affect the quality of data and its analysis. Thus, efforts should be made to detect and clean them automatically. Today, VigiBase holds more than 3.8 million COVID-19 vaccine adverse event reports, making deduplication a challenging problem for existing solutions employed in VigiBase. This thesis project explores methods for this task, explicitly focusing on records with a COVID-19 vaccine. We implement Jaccard similarity, TF-IDF, and BERT to leverage the abundance of information contained in the free-text narratives of the reports. Mean-pooling is applied to create sentence embeddings from word embeddings produced by a pre-trained SapBERT model fine-tuned to maximise the cosine similarity between narratives of duplicate reports. Narrative similarity is quantified by the cosine similarity between sentence embeddings. We apply a Gradient Boosted Decision Tree (GBDT) model for classifying report pairs as duplicates or non-duplicates. For a more calibrated model, logistic regression fine-tunes the leaf values of the GBDT. In addition, the model successfully implements a ruleset to find reports whose narratives mention a unique identifier of its duplicate. The best performing model achieves 73.3% recall and zero false positives on a controlled testing dataset for an F1-score of 84.6%, vastly outperforming VigiBase’s previously implemented model's F1-score of 60.1%. Further, when manually annotated by three reviewers, it reached an average 87% precision when fully deduplicating 11756 reports amongst records relating to hearing disorders. Duplicate detection Deduplication Record linkage Adverse Event Reports COVID-19 Vaccines Uppsala Monitoring Centre VigiBase Machine Learning Gradient Boosted Decision Trees BERT Natural Language Processing Pharmacovigilance Individual Case Safety Reports Engineering and Technology Teknik och teknologier Computer and Information Sciences Data- och informationsvetenskap
55	Indirect Influence of English on Kiswahili: The Case of Multiword Duplicates between Kiswahili and English Ochieng, Dunlop 04 February 2015 (has links) Some proverbs, idioms, nominal compounds, and slogans duplicate in form and meaning between several languages. An example of these between German and English is Liebe auf den ersten Blick and “love at first sight” (Flippo, 2009), whereas, an example between Kiswahili and English is uchaguzi ulio huru na haki and “free and fair election.” Duplication of these strings of words between languages that are as different in descent and typology as Kiswahili and English is irregular. On this ground, Kiswahili academies and a number of experts of Kiswahili assumed – prior to the present study – that the Kiswahili versions of the expressions are the derivatives from their English congruent counterparts. The assumption nonetheless lacked empirical evidence and also discounted other potential causes of the phenomenon, i.e. analogical extension, nativism and cognitive metaphoricalization (Makkai, 1972; Land, 1974; Lakoff & Johnson, 1980b; Ruhlen, 1987; Lakoff, 1987; Gleitman and Newport, 1995). Out of this background, we assumed an academic obligation of empirically investigating what causes this formal and semantic duplication of strings of words (multiword expressions) between English and Kiswahili to a degree beyond chance expectations. In this endeavour, we employed checklist to 24, interview to 43, online questionnaire to 102, translation test to 47 and translationality test to 8 respondents. Online questionnaire respondents were from 21 regions of Tanzania, whereas, those of the rest of the tools were from Zanzibar, Dar es Salaam, Pwani, Lindi, Dodoma and Kigoma. Complementarily, we analysed the Chemnitz Corpus of Swahili (CCS), the Helsinki Swahili Corpus (HSC), and the Corpus of Contemporary American English (COCA) for clues on the sources and trends of expressions exhibiting this characteristic between Kiswahili and English. Furthermore, we reviewed the Bible, dictionaries, encyclopaedia, books, articles, expressions lists, wikis, and phrase books in pursuit of etymologies, and histories of concepts underlying the focus expressions. Our analysis shows that most of the Kiswahili versions of the focus expressions are the function of loan translation and rendition from English. We found that economic, political and technological changes, mostly induced by liberalization policy of the 1990s in Tanzania, created lexical gaps in Kiswahili that needed to be filled. We discovered that Kiswahili, among other means, fill such gaps through loan translation and loan rendition of English phrases. Prototypical examples of notions whose English labels Kiswahili has translated word for word are such as “human rights”, “free and fair election”, “the World Cup” and “multiparty democracy”. We can conclude that Kiswahili finds it easier and economical to translate the existing English labels for imported notions rather than innovating original labels for the concepts. Even so, our analysis revealed that a few of the Kiswahili duplicate multiword expressions might be a function of nativism, cognitive metaphoricalization and analogy phenomena. We, for instance, observed that formulation of figurative meanings follow more or less similar pattern across human languages – the secondary meanings deriving from source domains. As long as the source domains are common in many human\'s environment, we found it plausible for certain multiword expressions to spontaneously duplicate between several human languages. Academically, our study has demonstrated how multiword expressions, which duplicate between several languages, can be studied using primary data, corpora, documentary review and observation. In particular, the study has designed a framework for studying sources of the expressions and even terminologies for describing the phenomenon. What\'s more, the study has collected a number of expressions that duplicate between Kiswahili and English languages, which other researchers can use in similar studies. info:eu-repo/classification/ddc/820 ddc:820
56	法人所得稅問題之研究王建□, Wang, Jian Unknown Date (has links) 現今各國之稅攻，類皆以推行法人所得稅為要務，期能建立直接稅為主幹的稅制。法人所得稅的發展，其歷史至今不過六、七十年，但在各國歲入上已占有重要的地位。因之，對法人所得稅的諸項問題加以研討，似有其必要。法人所得稅與個人所得稅之並行，各國不乏其例。法人所得在課征法人所得稅之後，分配於其構成分子時，尚須課征個人所得稅，遂使同一所得被課雙重所得稅，此種情形是否為重複課稅，不無問題。若其為重複課稅，又應如何補救﹖其補救方法雖甚繁多，但在實行時則應視經濟環境與需要，以及稅務行政效率如何作為選擇之依據。對於法人所得稅的性質，一般均認為是直接稅。直接稅是不能轉嫁其稅負於他人的。然而法人所得稅是否不能轉嫁，則頗有疑問。傳統的理論認為絕對不會轉嫁，但新的理論及企業界的統計，則認為法人所得稅在相當期間將會轉嫁，祇其轉嫁的程度如何不易測定而已。法人所得稅係屬所得稅之一種，是否可以與個人所得稅同樣採用高度的累進稅率，不無研究餘地。就平均社會財富的觀點來看，法人所得稅採用累進稅率，對平均財富確具有相當的功效。因為根據固德氏（Richard Goode ）的統計，法人所得確是造成財富集中的一項重要原因，尤其在經濟發達的國家更是如此。但就鼓勵儲蓄，增加投資的觀點來說，因為累進稅足以妨害儲蓄、減低投資的誘因，對於經濟發展發生不利的影響，所以法人所稅之採用累進稅率很不適宜。另就各國法人所得稅稅的演變來看，法人所得稅的稅率亦有趨於比例稅的傾向。此乃經濟政策重於社會政策的應有措施。本稿在研究以上三個理論問題以後，再提出兩個技術性的問題。一為固定資產的折舊問題；一為存貨估價的問題。前者是固定資產估價的主要問題，後者是流動資產評價的重要項目。折舊的高低、存貨的多寡，對法人所得稅都有直接性的影響。本稿對折舊及存貨皂的計算，就各種可能採用的方法加以討論，以明其對法人所得稅究有如何的影響。所得稅稅法法人直接稅重複課稅固德氏累進稅率資產折舊 INCOME-TAX TAXATION CORPORATE-ENTITY DIRECT-TAX DUPLICATE-TO-LEVY-TAXES RICHARD-GOODE PROGRESSIVE-TAX PROPERTY-DEPRECIATION
57	Comparative Genomics of Gossypium spp. through GBS and Candidate Genes – Delving into the Controlling Factors behind Photoperiodic Flowering Young, Carla Jo Logan 16 December 2013 (has links) Cotton has been a world-wide economic staple in textiles and oil production. There has been a concerted effort for cotton improvement to increase yield and quality to compete with non-natural man-made fibers. Unfortunately, cultivated cotton has limited genetic diversity; therefore finding new marketable traits within cultivated cotton has reached a plateau. To alleviate this problem, traditional breeding programs have been attempting to incorporate practical traits from wild relatives into cultivated lines. This incorporation has presented a new problem: uncultivated cotton hampered by photoperiodism. Traditionally, due to differing floral times, wild and cultivated cotton species were unable to be bred together in many commercial production areas world-wide. This worldwide breeding problem has inhibited new trait incorporation. Before favorable traits from undomesticated cotton could be integrated into cultivated elite lines using marker-assisted selection breeding, the markers associated with photoperiod independence needed to be discovered. In order to increase information about this debilitating trait, we set out to identify informative markers associated with photoperiodism. This study was segmented into four areas. First, we reviewed the history of cotton to highlight current problems in production. Next, we explored cotton’s floral development through a study of floral transition candidate genes. The third area was an in-depth analysis of Phytochrome C (previously linked to photoperiod independence in other crops). In the final area of study, we used Genotype-By-Sequencing (GBS), in a segregating population, was used to determine photoperiod independence associated with single nucleotide polymorphisms (SNPs). In short, this research reported SNP differences in thirty-eight candidate gene homologs within the flowering time network, including photoreceptors, light dependent transcripts, circadian clock regulators, and floral integrators. Also, our research linked other discrete SNP differences, in addition to those contained within candidate genes, to photoperiodicity within cotton. In conclusion, the SNP markers that our study found may be used in future marker assisted selection (MAS) breeding schemas to incorporate desirable traits into elite lines without the introgression of photoperiod sensitivity. Genotype by Sequencing Cotton Gossypium Reduced Representation Marker Assisted Selection Loci Linkage Disequilibrium Photoperiodism Photoperiod Flowering Wild Germplasm Introgression GBS Targeted GBS Cotton Duplicate Gene Evolution Gene Conversion Gossypium Polyploidy Linkage Disequilibrium Candidate Gene SNP Orthologs Circadian Clock Paralogs Phytochrome Floral
58	Duplicate detection of multimodal and domain-specific trouble reports when having few samples : An evaluation of models using natural language processing, machine learning, and Siamese networks pre-trained on automatically labeled data / Dublettdetektering av multimodala och domänspecifika buggrapporter med få träningsexempel : En utvärdering av modeller med naturlig språkbehandling, maskininlärning, och siamesiska nätverk förtränade på automatiskt märkt data Karlstrand, Viktor January 2022 (has links) Trouble and bug reports are essential in software maintenance and for identifying faults—a challenging and time-consuming task. In cases when the fault and reports are similar or identical to previous and already resolved ones, the effort can be reduced significantly making the prospect of automatically detecting duplicates very compelling. In this work, common methods and techniques in the literature are evaluated and compared on domain-specific and multimodal trouble reports from Ericsson software. The number of samples is few, which is a case not so well-studied in the area. On this basis, both traditional and more recent techniques based on deep learning are considered with the goal of accurately detecting duplicates. Firstly, the more traditional approach based on natural language processing and machine learning is evaluated using different vectorization techniques and similarity measures adapted and customized to the domain-specific trouble reports. The multimodality and many fields of the trouble reports call for a wide range of techniques, including term frequency-inverse document frequency, BM25, and latent semantic analysis. A pipeline processing each data field of the trouble reports independently and automatically weighing the importance of each data field is proposed. The best performing model achieves a recall rate of 89% for a duplicate candidate list size of 10. Further, obtaining knowledge on which types of data are most important for duplicate detection is explored through what is known as Shapley values. Results indicate that utilizing all types of data indeed improve performance, and that date and code parameters are strong indicators. Secondly, a Siamese network based on Transformer-encoders is evaluated on data fields believed to have some underlying representation of the semantic meaning or sequentially important information, which a deep model can capture. To alleviate the issues when having few samples, pre-training through automatic data labeling is studied. Results show an increase in performance compared to not pre-training the Siamese network. However, compared to the more traditional model it performs on par, indicating that traditional models may perform equally well when having few samples besides also being simpler, more robust, and faster. / Buggrapporter är kritiska för underhåll av mjukvara och för att identifiera fel — en utmanande och tidskrävande uppgift. I de fall då felet och rapporterna liknar eller är identiska med tidigare och redan lösta ärenden, kan tiden som krävs minskas avsevärt, vilket gör automatiskt detektering av dubbletter mycket önskvärd. I detta arbete utvärderas och jämförs vanliga metoder och tekniker i litteraturen på domänspecifika och multimodala buggrapporter från Ericssons mjukvara. Antalet tillgängliga träningsexempel är få, vilket inte är ett så välstuderat fall. Utifrån detta utvärderas både traditionella samt nyare tekniker baserade på djupinlärning med målet att detektera dubbletter så bra som möjligt. Först utvärderas det mer traditionella tillvägagångssättet baserat på naturlig språkbearbetning och maskininlärning med hjälp av olika vektoriseringstekniker och likhetsmått specialanpassade till buggrapporterna. Multimodaliteten och de många datafälten i buggrapporterna kräver en rad av tekniker, så som termfrekvens-invers dokumentfrekvens, BM25 och latent semantisk analys. I detta arbete föreslås en modell som behandlar varje datafält i buggrapporterna separat och automatiskt sammanväger varje datafälts betydelse. Den bäst presterande modellen uppnår en återkallningsfrekvens på 89% för en lista med 10 dubblettkandidater. Vidare undersöks vilka datafält som är mest viktiga för dubblettdetektering genom Shapley-värden. Resultaten tyder på att utnyttja alla tillgängliga datafält förbättrar prestandan, och att datum och kodparametrar är starka indikatorer. Sedan utvärderas ett siamesiskt nätverk baserat på Transformator-kodare på datafält som tros ha en underliggande representation av semantisk betydelse eller sekventiellt viktig information, vilket en djup modell kan utnyttja. För att lindra de problem som uppstår med få träningssexempel, studeras det hur den djupa modellen kan förtränas genom automatisk datamärkning. Resultaten visar på en ökning i prestanda jämfört med att inte förträna det siamesiska nätverket. Men jämfört med den mer traditionella modellen presterar den likvärdigt, vilket indikerar att mer traditionella modeller kan prestera lika bra när antalet träningsexempel är få, förutom att också vara enklare, mer robusta, och snabbare. Duplicate detection Bug reports Trouble reports Natural language processing Information retrieval Machine learning Siamese neural network Transformers Automated data labeling Shapley values Dubblettdetektering Felrapporter Buggrapporter Naturlig språkbehandling Informationssökning Maskininlärning Siamesiska neurala nätverk Transformatorer Automatiserad datamärkning Shapley-värden Computer and Information Sciences Data- och informationsvetenskap
59	Finding duplicate offers in the online marketplace catalogue using transformer based methods : An exploration of transformer based methods for the task of entity resolution / Hitta dubbletter av erbjudanden i online marknadsplatskatalog med hjälp av transformer-baserade metoder : En utforskning av transformer-baserad metoder för uppgiften att deduplicera Damian, Robert-Andrei January 2022 (has links) The amount of data available on the web is constantly growing, and e-commerce websites are no exception. Considering the abundance of available information, finding offers for the same product in the catalogue of different retailers represents a challenge. This problem is an interesting one and addresses the needs of multiple actors. A customer is interested in finding the best deal for the product they want to buy. A retailer wants to keep up to date with the competition and adapt its pricing strategy accordingly. Various services already offer the possibility of finding duplicate products in catalogues of e-commerce retailers, but their solutions are based on matching a Global Trade Identification Number (GTIN). This strategy is limited because a GTIN may not be made publicly available by a competitor, may be different for the same product exported by the manufacturer to different markets or may not even exist for low-value products. The field of Entity Resolution (ER), a sub-branch of Natural Language Processing (NLP), focuses on solving the issue of matching duplicate database entries when a deterministic identifier is not available. We investigate various solutions from the the field and present a new model called Spring R-SupCon that focuses on low volume datasets. Our work builds upon the recently introduced model, R-SupCon, introducing a new learning scheme that improves R-SupCon’s performance by up to 74.47% F1 score, and surpasses Ditto by up 12% F1 score for low volume datasets. Moreover, our experiments show that smaller language models can be used for ER with minimal loss in performance. This has the potential to extend the adoption of Transformer-based solutions to companies and markets where datasets are difficult to create, like it is the case for the Swedish marketplace Fyndiq. / Mängden data på internet växer konstant och e-handeln är inget undantag. Konsumenter har idag många valmöjligheter varifrån de väljer att göra sina inköp från. Detta gör att det blir svårare och svårare att hitta det bästa erbjudandet. Även för återförsäljare ökar svårigheten att veta vilken konkurrent som har lägst pris. Det finns tillgängliga lösningar på detta problem men de använder produktunika identifierare såsom Global Trade Identification Number (förkortat “GTIN”). Då det finns en rad utmaningar att bara förlita sig på lösningar som baseras på GTIN behövs ett alternativt tillvägagångssätt. GTIN är exempelvis inte en offentlig information och identifieraren kan dessutom vara en annan när samma produkt erbjuds på en annan marknad. Det här projektet undersöker alternativa lösningar som inte är baserade på en deterministisk identifierare. Detta projekt förlitar sig istället på text såsom produktens namn för att fastställa matchningar mellan olika erbjudanden. En rad olika implementeringar baserade på maskininlärning och djupinlärning studeras i detta projekt. Projektet har dock ett särskilt fokus på “Transformer”-baserade språkmodeller såsom BERT. Detta projekt visar hur man generera proprietär data. Projektet föreslår även ett nytt inlärningsschema och bevisar dess fördelar. / Le volume des données qui se trouve sur l’internet est en une augmentation constante et les commerces électroniques ne font pas note discordante. Le consommateur a aujourd’hui beaucoup des options quand il decide d’où faire son achat. Trouver le meilleur prix devient de plus en plus difficile. Les entreprises qui gerent cettes plates-formes ont aussi la difficulté de savoir en tous moments lesquels de ses concurrents ont le meilleur prix. Il y-a déjà des solutions en ligne qui ont l’objectif de résoudre ce problème, mais ils utilisent un identifiant de produit unique qui s’appelle Global Trade identification number (ou GTIN). Plusieurs difficultés posent des barriers sur cette solution. Par exemple, GTIN n’est pas public peut-être, ou des GTINs différents peut-être assigne par la fabricante au même produit pour distinguer des marchés différents. Ce projet étudie des solutions alternatives qui ne sont pas basées sur avoir un identifiant unique. On discute des methods qui font la décision en fonction du nom des produits, en utilisant des algorithmes d’apprentissage automatique ou d’apprentissage en profondeur. Le projet se concentre sur des solutions avec ”Transformer” modèles de langages, comme BERT. On voit aussi comme peut-on créer un ensemble de données propriétaire pour enseigner le modèle. Finalement, une nouvelle method d’apprentissage est proposée et analysée. Transformers Language Models Deep Neural Networks Entity Resolution Duplicate Detection Entity Matching Record Linkage Contrastive Learning e-commerce Transformers Modèles de langage Apprentisage en profondeur Résolution d’entité Détection de doublons Apprentisage contrastif commerce électronique Transformers Språkmodeller Djupinlärning Entitetserkännande Dubblettdetektering Entitetsmatchning Rekordkoppling e-handel Computer and Information Sciences Data- och informationsvetenskap

Search results