Global ETD Search

41	Operadores físicos binários para consultas por similaridade em SGBDR / Physical binary operators for similarity queries in RDBMS Luiz Olmes Carvalho 26 March 2018 (has links) O operador de Junção é um operador importante da Álgebra Relacional que combina os pares de tuplas que atendem a uma dada condição de comparação entre os valores dos atributos de duas relações. Quando a comparação avalia a similaridade entre pares de valores, o operador é chamado Junção por Similaridade. Esse operador tem aplicações em diversos contextos, tais como o suporte de tarefas de mineração e análise de dados em geral, e a detecção de quase-duplicatas, limpeza de dados e casamento de cadeias de caracteres em especial. Dentre os operadores de junção por similaridade existentes, a Junção por Abrangência (range join) é a mais explorada na literatura. Contudo, ela apresenta limitações, tal como a dificuldade para se encontrar um limiar de similaridade adequado. Nesse contexto, a Junção por k-vizinhos mais próximos (knearest neighbor join kNN join) é considerada mais intuitiva, e portanto mais útil que o range join. Entretanto, executar um kNN join é computacionalmente mais caro, o que demanda por abordagens baseadas na técnica de laço aninhado, e as técnicas existentes para a otimização do algoritmo são restritas a um domínio de dados em particular. Visando agilizar e generalizar a execução do kNN join, a primeira contribuição desta tese foi o desenvolvimento do algoritmo QuickNearest, baseado na técnica de divisão e conquista, que é independente do domínio dos dados, independente da função de distância utilizada, e que computa kNNjoins de maneira muito eficiente. Os experimentos realizados apontam que o QuickNearest chega a ser 4 ordens de magnitude mais rápido que os métodos atuais. Além disso, o uso de operadores de junção por similaridade em ambientes relacionais é problemático, principalmente por dois motivos: (i)emgeral o resultado tem cardinalidade muito maior do que o realmente necessário ou esperado pela maioria das aplicações de análise de dados; e (ii) as consultas que os utilizam envolvem também operações de ordenação, embora a ordem seja um conceito não associado à teoria relacional. A segunda contribuição da tese aborda esses dois problemas, tratando os operadores de junção por similaridade existentes como casos particulares de um conjunto mais amplo de operadores binários, para o qual foi definido o conceito de Wide-joins. Os operadores wide-joins recuperam os pares mais similares em geral e incorporam a ordenação como uma operação interna ao processamento, de forma compatível com a teoria relacional e que permite restringir a cardinalidade dos resultados a tuplas de maior interesse para as aplicações. Os experimentos realizados mostram que os wide-joins são rápidos o suficiente para serem usados em aplicações reais, retornam resultados de qualidade melhor do que os métodos concorrentes e são mais adequados para execução num ambiente relacional do que os operadores de junção por similaridade tradicionais. / Joins are important Relational Algebra operators. They pair tuples from two relations that meet a given comparison condition between the attribute values. When the evaluation compares the similarity among the values, the operator is called a Similarity Join. This operator has application to a variety of contexts, such as supporting data mining tasks and data analysis in general, and near-duplicate detection, data cleaning and string matching in particular. Among the existing types of similarity joins, the range join is the most explored one in the literature. However, it has several shortcomings, such as the diculty to find adequate similarity thresholds. In such context, the k-nearest neighbors join (kNN join) is considered more intuitive, and therefore more useful than the range join. However, the kNN join execution is computationally well more expensive, thus demanding implementations either based on nested loop techniques, which are generic, or on optimizing techniques but that are specific data given domains. In order to accelerate and generalize kNN join execution, the first contribution of this thesis was the development of the QuickNearest algorithm, based on the divide and conquest approach that is independent of the data domain, independent of the distance function used, and that computes kNN joins very eciently. Experiments performed with the QuickNearest algorithm show that it is up to four orders of magnitude faster than current methods. Nevertheless, using similarity join operators in relational environments remains generally troublesome, due to two main reasons: (i) the result often has a cardinality much larger than what is actually needed or expected by most of the data analysis applications; and (ii) queries that use them almost always also require sorting operations, but order concept is not present in the relational theory. The second contribution of the thesis addresses these two problems through the definition of the concept of Wide-joins, which turns the existing similarity join operators just as particular cases of a more powerful set of binary operators. Awide-join operator retrieves the pairs most similar in general and already incorporates ordering as an internal operation to its processing, what makes it fully compatible with the relational theory. The concept also provides powerful ways to restrict the result cardinality just to tuples really meaningful for the applications. In fact, the experiments have also shown that wide-joins are fast enough to be useful for real applications, they return results of better quality than competing methods, and are more suitable for execution in a relational environment than the traditional similarity join operators. Junção por similaridade kNN Operadores relacionais QuickNearest Wide-join kNN QuickNearest Relational operators Similarity join Wide-join
42	Identifiering av UNO-kort : En jämförelse av bildigenkänningstekniker Al-Asadi, Yousif, Streit, Jennifer January 2023 (has links) Att spela sällskapsspelet UNO är en typ av umgängesform där målet är att trivas. EnUNO-kortlek har 5 olika färger (blå, röd, grön, gul och joker) och olika symboler.Detta kan vara frustrerande för en person med nedsatt färgseende att delta, då enstor andel av spelet är beroende av att identifiera färgen på varje kort. Övergripandesyftet med detta arbete är att utveckla en prototyp för objektigenkänning av UNOkort som stöd för färgnedsatta. Arbetet sker genom jämförelse av objektigenkänningsmetoder som Convolutional Neural Network (CNN) och Template Matchinginspirerade metoder: hue template test samt binary template test. Detta kommer attjämföras i samband med igenkänning av färg och symbol tillsammans och separerat. Utvecklandet av prototypen kommer att utföras genom att träna två olika CNNmodeller, där en modell fokuserar endast på symboler och den andra fokuserar påbåde färg och symbol. Dessa modeller kommer att tränas med hjälp av YOLOv5 algoritmen som anses vara State Of The Art (SOTA) inom CNN med snabb exekvering. Samtidigt kommer template test att utvecklas med hjälp av OpenCV och genom att skapa mallar för korten. Dessa används för att göra en jämförelse av kortetsom ska identifieras med hjälp av mallen. Utöver detta kommer K Nearest Neighbor(KNN), en maskininlärningsalgoritm att utvecklas med syfte att identifiera endastfärg på korten. Slutligen utförs en jämförelse mellan dessa metoder genom mätningav prestanda som består av accuracy, precision, recall och latency. Jämförelsen kommer att ske mellan varje metod genom en confusion matrix för färger och symbolerför respektive modell. Resultatet av studien visade på att modellen som kombinerar CNN och KNN presterade bäst vid valideringen av de olika metoderna. Utöver detta visar studien atttemplate test är snabbare att implementera än CNN på grund av tiden för träningensom ett neuralt nätverk kräver. Dessutom visar latency att det finns en skillnad mellan de olika modellerna, där CNN presterade bäst. / Engaging in the social game of UNO represents a form of social interaction aimed atpromoting enjoyment. Each UNO card deck consists of five different colors (blue,red, green, yellow and joker) and various symbols. However participating in such agame can be frustrating for individuals with color vision impairment. Since a substantial portion of the game relies on accurately identifying the color of each card.The overall purpose of this research is to develop a prototype for object recognitionof UNO cards to support individuals with color vision impairment. This thesis involves comparing object recognition methods, namely Convolutional Neural Network (CNN) and Template Matching (TM). Each method will be compared with respect to color and symbol recognition both separately and combined. The development of such a prototype will be through creating and training two different CNN models, where the first model focuses on solely symbol recognitionwhile the other model incorporates both color and symbol recognition. These models will be trained though an algorithm called YOLOv5 which is considered state-ofthe-art (SOTA) with fast execution. At the same time, two models of TM inspiredmethods, hue template test and binary template test, will be developed with thehelp of OpenCV and by creating templates for the cards. Each template will be usedas a way to compare the detected card in order to classify it. Additionally, the KNearest Neighbor (KNN) algorithm, a machine learning algorithm, will be developed specifically to identify the color of the cards. Finally a comparative analysis ofthese methods will be conducted by evaluating performance metrics such as accuracy, precision, recall and latency. The comparison will be carried out in betweeneach method using a confusion matrix for color and symbol in respective models. The study’s findings revealed that the model combining CNN and KNN demonstrated the best performance during the validation of the different models. Furthermore, the study shows that template tests are faster to implement than CNN due tothe training that a neural network requires. Moreover, the execution time showsthat there is a difference between the different models, where CNN achieved thehighest performance. Image recognition CNN TM KNN UNO YOLOv5 Bildigenkänning CNN TM KNN UNO YOLOv5 Computer Sciences Datavetenskap (datalogi)
43	GIS-analys av potentiella habitat för mindre hackspett (Dendrocopos minor) : En analys i Karlstads kommun / GIS-analysis of potential habitat for the lesser spotted woodpecker (Dendrocopos minor) : An analysis in the municipality of Karlstad Palmgren, Annie January 2016 (has links) The purpose of the study is to develop a method that generates areas in the municipality of Karlstad that satisfies the habitat area requirement for the bird species lesser spotted woodpecker (Dendrocopos minor). The purpose is also to compare two different databases (kNN-Sweden and the Vegetation Map). Habitat area requirement for lesser spotted woodpecker is 40 ha of forest dominated by deciduous trees, which may be fragmented over a maximum of 200 ha. The software ArcMap was used to developed method to generate habitat areas, based on input from the kNN-Sweden and the Vegetation Map databases. The habitat areas were reviewed and compared by overlay analysis and compared to reported observations. Generated habitat areas from the kNN-Sweden database and generated habitat areas from the Vegetation Map database differed significantly. The format of the input data and the threshold values are probably contributing reasons of the difference. An important shortage of the kNN-Sweden database is that a buffer zone around the water surfaces at generalization has been masked off and hence the volume of mature deciduous forest generally underestimated. The number of observations of lesser spotted woodpecker within the habitat areas differed between the kNN-Sweden and the Vegetation Map that fulfilled the requirement. The Vegetation Map had 138 observations of lesser spotted woodpecker while the KNN-Sweden only had 38 observations. / Karlstads kommun behöver finna potentiella habitat för fågelarten mindre hackspett, som är en förslagen ansvarsart i kommunen. Mindre hackspett behöver minst 40 ha äldre lövdominerad skog inom ett område på upp till 200 ha för häckning. Behovet kan ses som artens habitatvillkor vid utsökning av potentiella områden för dess habitat. Syftet med studien är att utveckla en metod för att finna områden i Karlstads kommun som uppfyller habitatvillkoret för mindre hackspett. Syftet är även att jämföra två olika databaser, kNN-Sverige och Vegetationskartan, vid dess användning som indata. kNN-Sverige är en rikstäckande databas med information om Sveriges skogar och dess grundformat är digitala kartor i rasterformat med en upplösning på 25 meter. Informationen i kNN-Sverige bygger på en kombination av fältdata från Riksskogstaxeringens stickprovsinventering och heltäckande data från satellitbilder. Vegetationskartan består av polygonskikt innehållande klassning av olika vegetationstyper. Underlaget för vegetationsdata är flygbilder av närainfrarödkänslig färgfilm som har tolkats och karterats utifrån dominansförhållanden hos olika vegetationstyper, med stöd från aktivt fältarbete. Med hjälp av programvaran ArcMap 10.3 utvecklades en metod som genererade habitatområden, baserade på indata från kNN-Sverige respektive Vegetationskartan. Därefter granskades och jämfördes resultaten genom överlagringsanalys och kontroll mot inrapporterade observationer av mindre hackspett. Genererade habitatområden för kNN-Sverige respektive Vegetationskartan skiljde sig åt och det genererades betydligt fler områden med kNN-Sverige. Grundformatet på indata och valet av gränsvärden är troligen en bidragande faktor till skillnaderna. Resultaten från analysen av Vegetationskartan bedöms rimligare än kNN-Sveriges resultat. För kNN-Sverige saknades även en del områden där det finns mycket lövskog, till exempel vid Klarälvsdeltat. Vegetationskartans resultat påvisade däremot att det fanns områden med mycket lövskog kring Klarälvsdeltat. En stor brist hos kNN-Sverige är att en zon kring vattenytor har maskats bort vid generaliseringen och volymen av äldre lövskog generellt har underskattats, vilket bland annat kan förklara varför inte viktiga områden kring vatten kommit med. Antal observationer som låg inom habitatområden skilde sig betydligt mellan kNN-Sverige och Vegetationskartan, inom habitatområden som uppfyllde villkoret hade Vegetationskartan 138 observationer av mindre hackspett medan kNN-Sverige endast hade 38 observationer. Lesser spotted woodpecker habitat habitat requirement deciduous trees kNN-Sweden (kNN-Sverige) Vegetation Map (Vegetationskartan) GIS-analysis ArcMap Mindre hackspett habitat habitatvillkor lövskog kNN-Sverige Vegetationskartan GIS-analys ArcMap Computer and Information Sciences Data- och informationsvetenskap Earth and Related Environmental Sciences Geovetenskap och miljövetenskap
44	Algoritmo kNN na imputação de dados de espectros de massa do tipo MALDI-TOF: uma análise da influência da imputação com kNN sobre o desempenho de classificadores logísticos para identificação de bactérias Santos, Fábio dos 14 September 2018 (has links) Submitted by Angela Maria de Oliveira (amolivei@uepg.br) on 2018-11-06T17:08:39Z No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Fábio dos Santos.pdf: 1456053 bytes, checksum: 5ee15a88a68aaef87a46a8f42f816e32 (MD5) / Made available in DSpace on 2018-11-06T17:08:39Z (GMT). No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Fábio dos Santos.pdf: 1456053 bytes, checksum: 5ee15a88a68aaef87a46a8f42f816e32 (MD5) Previous issue date: 2018-09-14 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O processo de identiﬁcação de bactérias relacionadas ao crescimento vegetal,é alvo de diversos estudos na área de bioinformática. Uma das formas para realizar esta identiﬁcação é utilizar dados de espectrometria de massa do tipo MALDI-TOF para detectar a presença de proteínas ribossomaisemumaamostra,eentão,usarclassiﬁcadoresparaprocessarestesdadoseselecionar o rótulo com a maior probabilidade. Durante o processo de geração dos espectros de massa paraclassiﬁcaçãoécomumanãodetecçãodealgumdospicosrelacionadosaproteínasribossomais. Considerando isto, este trabalho apresenta um estudo sobre o uso do algoritmo kNN para imputação desses casos. O estudo foi desenvolvido com o uso de classiﬁcadores logísticos para identiﬁcação de bactérias da espécie Staphylococcus aureus e do gênero Bacillus. Durante os experimentos foram testados três técnicas para imputar dados: imputação com zero, imputação com a média do atributo faltante, e a imputação com kNN. Desta última foram usadas duas abordagens: função de agregação de média e função de agregação de mediana. O protocolo experimental implementado possibilitou avaliar a inﬂuência da imputação sobre os resultados de classiﬁcação sob diferentes cenários no que se refere ao número de variáveis faltantes. Os resultadosobtidosmostramqueoempregodokNNnãolevouàumareduçãododesempenhodos classiﬁcadores, em relação àquele observado quando do uso de dados completos. Além disto, a classiﬁcação de dados submetidos a imputação pelo kNN apresentou desempenho superior àquele veriﬁcado quando do uso dos demais métodos. / It is subject of several studies in bioinformatics area the plant growth promoting bacteria identiﬁcation process. An approach to performing it is to process sample’s ribosomal proteins data obtained by MALDI-TOF mass spectrometry through a classiﬁer and select the highest probability label. However, at the time of mass spectra generation, it is common not detecting some ribosomal proteins related peaks data. With this in mind, this work presents a study about data imputation through the kNN algorithm. Logistic classiﬁers were applied to identify bacteria of the Bacillus genus and the Staphylococcus aureus species while three data imputation techniques were tested: with zero, with the average of the missing attribute, and with kNN algorithm. From this latter imputation technique, two approaches were considered: average aggregation function and median aggregation function. The adopted experimental protocol investigated the imputation inﬂuence on classiﬁcation results under different scenarios regarding missing variablesnumber.TheresultsshowthatbothkNN’sapproachesdidnotpromotesigniﬁcantreduction on classiﬁers’ performance when compared with complete data approach and that the classiﬁcation of imputed data by kNN presented superior performance to that of other considered methods. Imputação com kNN Espectrometria de Massa Regressão Logística Classiﬁcação de Bactérias Imputation with kNN Mass Spectrometry Logistic Regression Bacterial Classiﬁcation
45	文件距離為基礎kNN分群技術與新聞事件偵測追蹤之研究 / A study of relative text-distance-based kNN clustering technique and news events detection and tracking 陳柏均, Chen, Po Chun Unknown Date (has links) 新聞事件可描述為「一個時間區間內、同一主題的相似新聞之集合」，而新聞大多僅是一完整事件的零碎片段，其內容也易受到媒體立場或撰寫角度不同有所差異；除此之外，龐大的新聞量亦使得想要瞭解事件全貌的困難度大增。因此，本研究將利用文字探勘技術群聚相關新聞為事件，以增進新聞所帶來的價值。分類分群為文字探勘中很常見的步驟，亦是本研究將新聞群聚成事件所運用到的主要方法。最近鄰 (k-nearest neighbor, kNN)搜尋法可視為分類法中最常見的演算法之一，但由於kNN在分類上必須要每篇新聞兩兩比較並排序才得以選出最近鄰，這也產生了kNN在實作上的效能瓶頸。本研究提出了一個「建立距離參考基準點」的方法RTD-based kNN (Relative Text-Distance-based kNN)，透過在向量空間中建立一個基準點，讓所有文件利用與基準點的相對距離建立起遠近的關係，使得在選取前k個最近鄰之前，直接以相對關係篩選出較可能的候選文件，進而選出前k個最近鄰，透過相對距離的概念減少比較次數以改善效率。本研究於Google News中抽取62個事件(共742篇新聞)，並依其分群結果作為測試與評估依據，以比較RTD-based kNN與kNN新聞事件分群時的績效。實驗結果呈現出RTD-based kNN的基準點以常用字字彙建立較佳，分群後的再合併則有助於改善結果，而在RTD-based kNN與kNN的F-measure並無顯著差距(α=0.05)的情況下，RTD-based kNN的運算時間低於kNN達28.13%。顯示RTD-based kNN能提供新聞事件分群時一個更好的方法。最後，本研究提供一些未來研究之方向。 / News Events can be described as "the aggregation of many similar news that describe the particular incident within a specific timeframe". Most of news article portraits only a part of a passage, and many of the content are bias because of different media standpoint or different viewpoint of reporters; in addition, the massive news source increases complexity of the incident. Therefore, this research paper employs Text Mining Technique to cluster similar news to a events that can value added a news contributed. Classification and Clustering technique is a frequently used in Text Mining, and K-nearest neighbor(kNN) is one of most common algorithms apply in classification. However, kNN requires massive comparison on each individual article, and it becomes the performance bottlenecks of kNN. This research proposed Relative Text-Distance-based kNN(RTD-based kNN), the core concept of this method is establish a Base, a distance reference point, through a Vector Space, all documents can create the distance relationship through the relative distance between itself and base. Through the concept of relative distance, it can decrease the number of comparison and improve the efficiency. This research chooses a sample of 62 events (with total of 742 news articles) from Google News for the test and evaluation. Under the condition of RTD-based kNN and kNN with a no significant difference in F-measure (α=0.05), RTD-based kNN out perform kNN in time decreased by 28.13%. This confirms RTD-based kNN is a better method in clustering news event. At last, this research provides some of the research aspect for the future. 文字探勘 kNN 事件偵測與追蹤分類分群 Text Mining kNN Events Detection and Tracking Classification and Clustering
46	Neparametrinio kNN metodo taikymo miškų inventorizacijoje tyrimai / Investigation of the application of non – parametric kNN method for forest inventory Jonikavičius, Donatas 16 August 2007 (has links) Magistro darbe yra nagrinėjamos neparametrinio knn (k-nearest neighbor) metodo taikymo galimybės Lietuvos sąlygomis vertinant tradicinius miško taksacinius rodiklius bet kokiame šalies teritorijos taške. Darbo objektas – Dubravos miškų urėdijos Dubravos miškas. Darbo tikslas – įvertinti neparametrinio knn (k-nearest neighbor) metodo taikymo Lietuvos miškų inventorizacijose galimybes. Darbo rezultatai. Nustatyta, kad taksacinių rodiklių įvertinimo knn metodu tikslumas kyla didinant apskaitos vienetų, išmatuotų vietovėje, skaičių. Pagrindiniai knn metodo parametrai, kuriais gauti geriausi rezultatai, buvo: 10 artimiausių kaimynų (k reikšmė), atvirkščiai proporcingo atstumo schema, nusakant kiekvieno iš artimiausių kaimynų svertus. Papildomos pagalbinės informacijos – tradicinės sklypinės miškų inventorizacijos metu nustatytų medynų taksacinių rodiklių – panaudojimas kartu su kosminiais Spot Xi vaizdais padidina taksacinių rodiklių įvertinimo tikslumą. Pritaikius optimalų knn metodo taikymo taktikos variantą, mažiausios pasiektos taksacinių rodiklių nustatymo vidutinės kvadratinės paklaidos sudarė 27% medyno vidutinio skersmens, 20% vidutinio aukščio, 40% skerspločių sumos, 35% vidutinio amžiaus, 43% tūrio viename ha, 33% spygliuočių procento rodiklio. Pasitelkus 1999 metų Spot Xi kosminius vaizdus, 1986 apskaitos bareliuose išmatuotas pagrindines medynų taksacines charakteristikas bei 1988 metų sklypinės miškotvarkos duomenis, knn metodu nustatyti pagrindinių taksacinių... [toliau žr. visą tekstą] / The research is dealing with investigations of non-parametric knn (k-nearest neighbor) method for estimation of standard forest characteristics at any point of an area under Lithuanian conditions. Study object: Dubrava forest, managed by Dubrava experimental forest enterprise. Objectives: to assess the usability of non-parametric knn (k-nearest neighbor) method in Lithuanian forests inventory. Results. The increase in number of sample plots with known field information was found to improve the estimation accuracy. The most important parameters for use of knn methods were the following: 10 nearest neighbors (value of k), inverse distance weighted scheme for defining the weights of selected neighbors. Integrating of additional auxiliary information – characteristics of forest compartments, estimated during the conventional stand-wise inventory – to be used together with Spot Xi images improved the overall accuracy of estimations. The lowest achieved root mean square errors were 27% of the average value of all plots within the study area for mean diameter, 20% for mean height, 40% for basal area, 35% for mean age, 43% for volume per 1ha and 33% for the percent of coniferous species in stand tree species composition, when the optimal knn tactics were applied. Spot Xi images from the year 1999, main forest characteristics from 1986 field measured sample plots and data of conventional stand-wise forest inventory from the year 1988 were utilized to estimate using knn method the... [to full text] Forestry Miškų inventorizacija Kosminiai vaizdai Neparametriniai metodai Medynų taksaciniai rodikliai Knn metodas Forest inventory Spatial images Non-parametric methods Forest stand characteristics Knn method
47	應用文字探勘分析網路團購商品群集之研究－以美食類商品為例 / The study of analyzing group-buying goods clusters by using text mining – exemplified by the group-buying foods 趙婉婷 Unknown Date (has links) 網路團購消費模式掀起一陣風潮，隨著網路團購市場接受度提高，現今以團購方式進行購物的消費模式不斷增加，團購商品品項也日益繁多。為了使網路團購消費者更容易找到感興趣的團購商品，本研究將針對團購商品進行群集分析。本研究以國內知名團購網站「愛合購」為例，以甜點蛋糕分類下的熱門美食團購商品為主，依商品名稱找尋該商品的顧客團購網誌文章納入資料庫中。本研究從熱門度前1000項的產品中找到268項產品擁有顧客團購網誌586篇，透過文字探勘技術從中擷取產品特徵相關資訊，並以「ｋ最近鄰居法」為基礎建置kNN分群器，以進行群集分析。本研究依不同的k值以及分群門檻值進行分群，並對大群集進行階段式分群，單項群集進行質心合併，以尋求較佳之分群結果。研究結果顯示，268項團購商品經過kNN分群器進行四個階段的群集分析後可獲得28個群集，群內相似度從未分群時的0.029834提升至0.177428。在經過第一階段的分群後，可將商品分為3個主要大群集，即「麵包類」、「蛋糕類」以及「其他口感類」。在進行完四個階段的分群後，「麵包類」可分為2種類型的群集，即『麵包類產品』以及『擁有麵包特質的產品』，而「蛋糕類」則是可依口味區分為不同的蛋糕群集。產品重要特徵詞彙不像一般文章的關鍵字詞會重複出現於文章中，因此在特徵詞彙過濾時應避免刪減過多的產品特徵詞彙。群集特性可由詞彙權重前20%之詞彙依人工過濾及商品出現頻率挑選出產品特徵代表詞來做描繪。研究所獲得之分群結果除了提供團購消費者選擇產品時參考外，也可幫助團購網站業者規劃更適切的行銷活動。本研究亦提出一些未來研究方向。 / Group-buying is prevailing, the items of merchandise diverse recently. In order to let consumer find the commodities they are interested in, the research focus on the cluster analysis about group-buying products and clusters products by the features of them. We catch the blogs of products posted by customers, via text mining to retrieve the features of products, and then establish the kNN clustering device to cluster them. This research sets different threshold values to test, and multiply clusters big groups, and merges small groups by centroid, we expect to obtain the best quality cluster. From the results, 268 items of group-buying foods can be divided into 28 clusters, and the mean of Intra-Similarity also can be improved. The 28 clusters can be categorized to three main clusters：Bread, Cake, and Other mouthfeel foods. We can define and name each cluster by catch the top twenty percent of the keywords in each cluster. The results of this paper could help buyers find similar commodities which they like, and also help sellers make the great marketing activity plan. 文字探勘團購最近鄰居法 kNN分群 Text Mining Group-buying k-Nearest Neighbors kNN clustering
48	Word Embeddings in Database Systems Günther, Michael 18 November 2021 (has links) Research in natural language processing (NLP) focuses recently on the development of learned language models called word embedding models like word2vec, fastText, and BERT. Pre-trained on large amounts of unstructured text in natural language, those embedding models constitute a rich source of common knowledge in the domain of the text used for the training. In the NLP community, significant improvements are achieved by using those models together with deep neural network models. To support applications to benefit from word embeddings, we extend the capabilities of traditional relational database systems, which are still by far the most common DBMSs but only provide limited text analysis features. Therefore, we implement (a) novel database operations involving embedding representations to allow a database user to exploit the knowledge encoded in word embedding models for advanced text analysis operations. The integration of those operations into database query language enables users to construct queries using novel word embedding operations in conjunction with traditional query capabilities of SQL. To allow efficient retrieval of embedding representations and fast execution of the operations, we implement (b) novel search algorithms and index structures for approximated kNN-Joins and integrate those into a relational database management system. Moreover, we investigate techniques to optimize embedding representations of text values in database systems. Therefore, we design (c) a novel context adaptation algorithm. This algorithm utilizes the structured data present in the database to enrich the embedding representations of text values to model their context-specific semantic in the database. Besides, we provide (d) support for selecting a word embedding model suitable for a user's application. Therefore, we developed a data processing pipeline to construct a dataset for domain-specific word embedding evaluation. Finally, we propose (e) novel embedding techniques for pre-training on tabular data to support applications working with text values in tables. Our proposed embedding techniques model semantic relations arising from the alignment of words in tabular layouts that can only hardly be derived from text documents, e.g., relations between table schema and table body. In this way, many applications, which either employ embeddings in supervised machine learning models, e.g., to classify cells in spreadsheets, or through the application of arithmetic operations, e.g., table discovery applications, can profit from the proposed embedding techniques.:1 INTRODUCTION 1.1 Contribution 1.2 Outline 2 REPRESENTATION OF TEXT FOR NATURAL LANGUAGE PROCESSING 2.1 Natural Language Processing Systems 2.2 Word Embedding Models 2.2.1 Matrix Factorization Methods 2.2.2 Learned Distributed Representations 2.2.3 Contextualize Word Embeddings 2.2.4 Advantages of Contextualize and Static Word Embeddings 2.2.5 Properties of Static Word Embeddings 2.2.6 Node Embeddings 2.2.7 Non-Euclidean Embedding Techniques 2.3 Evaluation of Word Embeddings 2.3.1 Similarity Evaluation 2.3.2 Analogy Evaluation 2.3.3 Cluster-based Evaluation 2.4 Application for Tabular Data 2.4.1 Semantic Search 2.4.2 Data Curation 2.4.3 Data Discovery 3 SYSTEM OVERVIEW 3.1 Opportunities of an Integration 3.2 Characteristics of Word Vectors 3.3 Objectives and Challenges 3.4 Word Embedding Operations 3.5 Performance Optimization of Operations 3.6 Context Adaptation 3.7 Requirements for Model Recommendation 3.8 Tabular Embedding Models 4 MANAGEMENT OF EMBEDDING REPRESENTATIONS IN DATABASE SYSTEMS 4.1 Integration of Operations in an RDBMS 4.1.1 System Architecture 4.1.2 Storage Formats 4.1.3 User-Defined Functions 4.1.4 Web Application 4.2 Nearest Neighbor Search 4.2.1 Tree-based Methods 4.2.2 Proximity Graphs 4.2.3 Locality-Sensitive Hashing 4.2.4 Quantization Techniques 4.3 Applicability of ANN Techniques for Word Embedding kNN-Joins 4.4 Related Work on kNN Search in Database Systems 4.5 ANN-Joins for Relational Database Systems 4.5.1 Index Architecture 4.5.2 Search Algorithm 4.5.3 Distance Calculation 4.5.4 Optimization Capabilities 4.5.5 Estimation of the Number of Targets 4.5.6 Flexible Product Quantization 4.5.7 Further Optimizations 4.5.8 Parameter Tuning 4.5.9 kNN-Joins for Word2Bits 4.6 Evaluation 4.6.1 Experimental Setup 4.6.2 Influence of Index Parameters on Precision and Execution Time 4.6.3 Performance of Subroutines 4.6.4 Flexible Product Quantization 4.6.5 Accuracy of the Target Size Estimation 4.6.6 Performance of Word2Bits kNN-Join 4.7 Summary 5 CONTEXT ADAPTATION FOR WORD EMBEDDING OPTIMIZATION 5.1 Related Work 5.1.1 Graph and Text Joint Embedding Methods 5.1.2 Retrofitting Approaches 5.1.3 Table Embedding Models 5.2 Relational Retrofitting Approach 5.2.1 Data Preparation 5.2.2 Relational Retrofitting Problem 5.2.3 Relational Retrofitting Algorithm 5.2.4 Online-RETRO 5.3 Evaluation Platform: Retro Live 5.3.1 Functionality 5.3.2 Interface 5.4 Evaluation 5.4.1 Datasets 5.4.2 Training of Embeddings 5.4.3 Machine Learning Models 5.4.4 Evaluation of ML Models 5.4.5 Run-time Measurements 5.4.6 Online Retrofitting 5.5 Summary 6 MODEL RECOMMENDATION 6.1 Related Work 6.1.1 Extrinsic Evaluation 6.1.2 Intrinsic Evaluation 6.2 Architecture of FacetE 6.3 Evaluation Dataset Construction Pipeline 6.3.1 Web Table Filtering and Facet Candidate Generation 6.3.2 Check Soft Functional Dependencies 6.3.3 Post-Filtering 6.3.4 Categorization 6.4 Evaluation of Popular Word Embedding Models 6.4.1 Domain-Agnostic Evaluation 6.4.2 Evaluation of a Single Facet 6.4.3 Evaluation of an Object Set 6.5 Summary 7 TABULAR TEXT EMBEDDINGS 7.1 Related Work 7.1.1 Static Table Embedding Models 7.1.2 Contextualized Table Embedding Models 7.2 Web Table Embedding Model 7.2.1 Preprocessing 7.2.2 Text Serialization 7.2.3 Encoding Model 7.2.4 Embedding Training 7.3 Applications for Table Embeddings 7.3.1 Table Union Search 7.3.2 Classification Tasks 7.4 Evaluation 7.4.1 Intrinsic Evaluation 7.4.2 Table Union Search Evaluation 7.4.3 Table Layout Classification 7.4.4 Spreadsheet Cell Classification 7.5 Summary 8 CONCLUSION 8.1 Summary 8.2 Directions for Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A CONVEXITY OF RELATIONAL RETROFITTING B EVALUATION OF THE RELATIONAL RETROFITTING HYPERPARAMETERS info:eu-repo/classification/ddc/004 ddc:004
49	Développement de couches minces ferroélectriques sans plomb et intégration dans des antennes miniatures reconfigurables / Elaboration of lead-free ferroelectric thin films and their integration in tunable miniature antennas working at microwave frequencies Aspe, Barthélémy 08 October 2019 (has links) L'intégration d'oxydes ferroélectriques permet la réduction des dimensions de dispositifs électroniques pour des applications en télécommunications, tout en leur apportant la reconfigurabilité. Parmi ces matériaux multifonctionnels, KxNa1-xNbO3 (KNN) se présente comme un candidat oxyde sans plomb prometteur pour un grand nombre d'applications. L'objectif de cette thèse est l'élaboration de couches minces de KNN et l'étude de leurs propriétés diélectriques en hyperfréquences en vue de leur intégration dans des antennes miniatures et reconfigurables. La permittivité εr, les pertes tanδ et l'agilité seront caractérisées à partir de couches minces de KNN déposées par la technique d'ablation laser. De plus, les avancements sur les dépôts par pulvérisation cathodique seront présentés. Après une étude sur la composition du matériau, l'influence des propriétés structurales sur les propriété diélectriques à travers l'utilisation de deux types de substrats a permis l'obtention d'une agilité de 20% sous un champ Ebias de 90 kV/cm. Une caractérisation diélectrique en température aura permis d'observer, à 10 GHz, une augmentation de la permittivité de 360 à 20°C jusqu'à 1000 à 240°C au niveau de la transition de phase polymorphique. Lors de ces travaux, la phase bronze de tungstène tétragonale (TTB), encore très peu étudiée dans le système K-Na-Nb-O a été préparée en couches minces fortement orientées et une permittivité élevée a été obtenue à basses (~200 à 10 kHz) et hautes fréquences (~130 à 10 GHz). Enfin la conception, la réalisation et la mesure d'antennes miniatures intégrant du KNN ont été effectuées. / Ferroelectric materials are a solution for reducing the size of electronic devices for telecommunication applications while also enabling reconfigurability. Among the multifunctional materials, KxNa1-xNbO3 (KNN) is a promising lead-free oxide for a large number of applications. The main goal of this work is the elaboration of KNN thin films and their dielectric characterisations in order to integrate the thin film to obtain miniature reconfigurable antennas. The permittivity εr, the loss tanδ and the tunability, at microwave frequencies, of the KNN were retrieved from thin films prepared by pulsed laser deposition. Also, the progress on the deposition of KNN thin films by RF magnetron sputtering will be presented. After the investigation of the effect of the composition and the structural properties of the KNN thin films on their dielectric properties, the tunability has been increased up to 20% under a 90 kV/cm electric field for x = 0.5. A dielectric characterisation of depending on the temperature, at 10 GHz, has shown an increase of the permittivity value from 360 at 20°C up to 1000 at 240°C, indicating the polymorphic phase transition. The tetragonal tungsten bronze phase (TTB), barely studied in the K-Na-Nb-O system, has been prepared in thin film and exhibiting high values of permittivity at both low and microwave frequencies (~200 à 10 kHz and ~130 à 10 GHz). Finally the design, realisation and measurements of miniature antennas integrating KNN has been done. Couches minces Ablation laser Dispositifs agiles Hyperfréquences Antenne miniature KxNa1-XNbO3 (KNN) Thin films Pulsed laser deposition Tunable device Microwave frequencies Miniature antenna KxNa1-XNbO3 (KNN)
50	Applicering av maskininlärning för att predicera utfall av Kickstarter-projekt / Application of machine learning to predict outcome of Kickstarter-projects Lidén, Rickard, In, Gabriel January 2021 (has links) Crowdfunding är i den moderna digitala världen ett populärt sätt att samla in pengar till sitt projekt. Kickstarter är en av de ledande sidorna för crowdfunding. Predicering av ett Kickstarter-projekts framgång eller misslyckande kan därav vara av stort intresse för entreprenörer.Studiens syfte är att jämföra fyra olika algoritmers prediceringsförmåga på två olika Kickstarter-dataset. Det ena datasetet sträcker sig mellan åren 2020-2021, och det andra mellan åren 2016-2021. Algoritmerna som jämförs är KNN, Naive Bayes, MLP, och Random Forest.Av dessa fyra modeller så skapades i denna studie de bästa produktionsmodellerna av KNN och Random Forest. KNN var bäst för 2020-2021-datasetet, med 77,0% träffsäkerhet. Random Forest var bäst för 2016-2021-datasetet, med 76,8% träffsäkerhet. / Crowdfunding has in the modern, digitalized world become a popular method for gathering money for a project. Kickstarter is one of the most popular websites for crowdfunding. This means that predicting the success or failure of a Kickstarter-project by way of machine learning could be of great interest to entrepreneurs.The purpose of this study is to compare the predictive abilities of four different algorithms on two different Kickstarter-datasets. One dataset contains data in the span of the years 2020-2021, and the other contains data from 2016-2021. The algorithms used in this study are KNN, Naive Bayes, MLP and Random Forest.Out of these four algorithms, the top-performing prediction abilities for the two datasets were found in KNN and Random Forest. KNN was the best-performing algorithm for 2020-2021, with 77,0% accuracy. Random Forest had the top score for 2016-2021, with 76,8% accuracy. The language used in this study is Swedish. Crowdfunding Machine learning Random Forest Multilayer Perceptron KNN Naive Bayes Crowdfunding Maskininlärning Random Forest Multilayer Perceptron KNN Naive Bayes Computer and Information Sciences Data- och informationsvetenskap

Search results