• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 47
  • 10
  • 7
  • 6
  • 5
  • 3
  • 3
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 100
  • 100
  • 85
  • 63
  • 19
  • 17
  • 14
  • 12
  • 11
  • 11
  • 10
  • 9
  • 9
  • 9
  • 9
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
81

Host-pathogen interactions and evolution of epitopes in HIV-1: understanding selection and escape

Paul, Sinu 16 April 2012 (has links)
No description available.
82

pcApriori: Scalable apriori for multiprocessor systems

Schlegel, Benjamin, Kiefer, Tim, Kissinger, Thomas, Lehner, Wolfgang 16 September 2022 (has links)
Frequent-itemset mining is an important part of data mining. It is a computational and memory intensive task and has a large number of scientific and statistical application areas. In many of them, the datasets can easily grow up to tens or even several hundred gigabytes of data. Hence, efficient algorithms are required to process such amounts of data. In the recent years, there have been proposed many efficient sequential mining algorithms, which however cannot exploit current and future systems providing large degrees of parallelism. Contrary, the number of parallel frequent-itemset mining algorithms is rather small and most of them do not scale well as the number of threads is largely increased. In this paper, we present a highly-scalable mining algorithm that is based on the well-known Apriori algorithm; it is optimized for processing very large datasets on multiprocessor systems. The key idea of pcApriori is to employ a modified producer--consumer processing scheme, which partitions the data during processing and distributes it to the available threads. We conduct many experiments on large datasets. pcApriori scales almost linear on our test system comprising 32 cores.
83

Scalable frequent itemset mining on many-core processors

Schlegel, Benjamin, Karnagel, Thomas, Kiefer, Tim, Lehner, Wolfgang 19 September 2022 (has links)
Frequent-itemset mining is an essential part of the association rule mining process, which has many application areas. It is a computation and memory intensive task with many opportunities for optimization. Many efficient sequential and parallel algorithms were proposed in the recent years. Most of the parallel algorithms, however, cannot cope with the huge number of threads that are provided by large multiprocessor or many-core systems. In this paper, we provide a highly parallel version of the well-known Eclat algorithm. It runs on both, multiprocessor systems and many-core coprocessors, and scales well up to a very large number of threads---244 in our experiments. To evaluate mcEclat's performance, we conducted many experiments on realistic datasets. mcEclat achieves high speedups of up to 11.5x and 100x on a 12-core multiprocessor system and a 61-core Xeon Phi many-core coprocessor, respectively. Furthermore, mcEclat is competitive with highly optimized existing frequent-itemset mining implementations taken from the FIMI repository.
84

應用資料採礦於連鎖藥局商品 / The Application of Data Mining on the Association of Pharmaceutical Products Through the Chain Pharmacies

王詠立, Wang YungLi Unknown Date (has links)
台灣連鎖藥通路已逐漸轉型為複合式藥局,除了購買處方藥以外,現今藥局銷售商品種類眾多且逐漸成為社區商店之型態,讓民眾一次性購足藥品、化妝品、食品及生活日用品等。顧客於門市消費後累積了大筆的顧客會員銷售資料,本研究結合大數據資料採礦之技術,應用在顧客購買行為與行銷策略之間的相應關係,並藉此了解顧客在藥局通路的消費型態,進而衍生出符合顧客需求的行銷組合方案。 本研究藉由台灣某家連鎖藥局的銷售時點情報系統(POINT OF SALE, POS)資料分析顧客會員之購買行為,依據會員之購買日期、購買品項、購買金額等,應用資料採礦分析方法,先利用RFM模型分析顧客價值群的特性概況,再利用APRIORI演算法針對該連鎖藥局的八大類別商品銷售資料探討顧客會員購買產品的關聯規則,依照結果衍生出不同的商品銷售組合,並在門市執行有效的行銷策略以提升營業額。最後,依據研究結果對該家連鎖藥局提供銷售的策略及建議,作為該連鎖藥局業者後續經營之參考。 / The development of the domestic pharmacies in Taiwan has influenced by the government policy and logistics. Gradually, the traditional pharmacy had been replaced by the chain pharmacies to face the demands of products variety, customized and the consumer information. The members' purchase behaviors were analyzed in this study through the Point of Sale(POS) data from chain pharmacy headquarters. The purchase behaviors of the pharmacy members were analyzed based on purchase date, purchase item, amount of purchase etc. First, data regarding customer purchasing records are collected and widely known RFM model is used to evaluate the value for each customer. Second, an association rule mining tool, the Apriori algorithm, is used to analyze the relationship between customer purchasing records and products, to obtain hidden and useful purchasing rules for each product category. The association rules obtained can help the decision managers plan their new cross-selling strategies for products in future.
85

高效率常見超集合探勘演算法之研究 / Efficient Algorithms for the Discovery of Frequent Superset

廖忠訓, Liao, Zhung-Xun Unknown Date (has links)
過去對於探勘常見項目集的研究僅限於找出資料庫中交易紀錄的子集合,在這篇論文中,我們提出一個新的探勘主題:常見超集合探勘。常見超集合意指它包含資料庫中各筆紀錄的筆數多於最小門檻值,而原本用來探勘常見子集合的演算法並無法直接套用,因此我們以補集合的角度,提出了三個快速的演算法來解決這個新的問題。首先為Apriori-C:此為使用先廣後深搜尋的演算法,並且以掃描資料庫的方式來決定具有相同長度之候選超集合的支持度,第二個方法是Eclat-C:此為採用先深後廣搜尋的演算法,並且搭配交集法來計算倏選超集合的支持度,最後是DCT:此方法可利用過去常見子集合探勘的演算法來進行探勘,如此可以省下開發新系統的成本。 常見超集合的探勘可以應用在電子化的遠距學習系統,生物資訊及工作排程的問題上。尤其在線上學習系統,我們可以利用常見超集合來代表一群學生的學習行為,並且藉以預測學生的學習成就,使得老師可以及時發現學生的學習迷失等行為;此外,透過常見超集合的探勘,我們也可以為學生推薦個人化的課程,以達到因材施教的教學目標。 在實驗的部份,我們比較了各演算法的效率,並且分別改變實驗資料庫的下列四種變因:1) 交易資料的筆數、2) 每筆交易資料的平均長度、3) 資料庫中項目的總數和4) 最小門檻值。在最後的分析當中,可以清楚地看出我們提出的各種方法皆十分有效率並且具有可延伸性。 / The algorithms for the discovery of frequent itemset have been investigated widely. These frequent itemsets are subsets of database. In this thesis, we propose a novel mining task: mining frequent superset from the database of itemsets that is useful in bioinformatics, E-learning systems, jobshop scheduling, and so on. A frequent superset means that the number of transactions contained in it is not less than minimum support threshold. Intuitively, according to the Apriori algorithm, the level-wise discovering starts from 1-itemset, 2-itemset, and so forth. However, such steps cannot utilize the property of Apriori to reduce search space, because if an itemset is not frequent, its superset maybe frequent. In order to solve this problem, we propose three methods. The first is the Apriori-based approach, called Apriori-C. The second is the Eclat-based approach, called Eclat-C, which is a depth-first approach. The last is the proposed data complement technique (DCT) that we utilize original frequent itemset mining approach to discover frequent superset. The experimental studies compare the performance of the proposed three methods by considering the effect of the number of transactions, the average length of transactions, the number of different items, and minimum support. The analysis shows that the proposed algorithms are time efficient and scalable.
86

Méthode d'analyse de données pour le diagnostic a posteriori de défauts de production - Application au secteur de la microélectronique / A post-hoc Data Mining method for defect diagnosis - Application to the microelectronics sector

Yahyaoui, Hasna 21 October 2015 (has links)
La maîtrise du rendement d’un site de fabrication et l’identification rapide des causes de perte de qualité restent un défi quotidien pour les industriels, qui font face à une concurrence continue. Dans ce cadre, cette thèse a pour ambition de proposer une démarche d’analyse permettant l’identification rapide de l’origine d’un défaut, à travers l’exploitation d’un maximum des données disponibles grâce aux outils de contrôle qualité, tel que la FDC, la métrologie, les tests paramétriques PT, et le tri électriques EWS. Nous avons proposé une nouvelle méthode hybride de fouille de données, nommée CLARIF, qui combine trois méthodes de fouille de données à savoir, le clustering, les règles d’association et l’induction d’arbres de décision. Cette méthode se base sur la génération non supervisée d’un ensemble de modes de production potentiellement problématiques, qui sont caractérisés par des conditions particulières de production. Elle permet, donc, une analyse qui descend au niveau des paramètres de fonctionnement des équipements. L’originalité de la méthode consiste dans (1) une étape de prétraitement pour l’identification de motifs spatiaux à partir des données de contrôle, (2) la génération non supervisée de modes de production candidats pour expliquer le défaut. Nous optimisons la génération des règles d’association à travers la proposition de l’algorithme ARCI, qui est une adaptation du célèbre algorithme de fouille de règles d’association, APRIORI, afin de permettre d’intégrer les contraintes spécifiques à la problématique de CLARIF, et des indicateurs de qualité de filtrage des règles à identifier, à savoir la confiance, la contribution et la complexité. Finalement, nous avons défini un processus d’Extraction de Connaissances à partir des Données, ECD permettant de guider l’utilisateur dans l’application de CLARIF pour expliquer une perte de qualité locale ou globale. / Controlling the performance of a manufacturing site and the rapid identification of quality loss causes remain a daily challenge for manufacturers, who face continuing competition. In this context, this thesis aims to provide an analytical approach for the rapid identification of defect origins, by exploring data available thanks to different quality control systems, such FDC, metrology, parametric tests PT and the Electrical Wafer Sorting EWS. The proposed method, named CLARIF, combines three complementary data mining techniques namely clustering, association rules and decision trees induction. This method is based on unsupervised generation of a set of potentially problematic production modes, which are characterized by specific manufacturing conditions. Thus, we provide an analysis which descends to the level of equipment operating parameters. The originality of this method consists on (1) a pre-treatment step to identify spatial patterns from quality control data, (2) an unsupervised generation of manufacturing modes candidates to explain the quality loss case. We optimize the generation of association rules through the proposed ARCI algorithm, which is an adaptation of the famous association rules mining algorithm, APRIORI to integrate the constraints specific to our issue and filtering quality indicators, namely confidence, contribution and complexity, in order to identify the most interesting rules. Finally, we defined a Knowledge Discovery from Databases process, enabling to guide the user in applying CLARIF to explain both local and global quality loss problems.
87

資料挖掘應用於入口網站之顧客關係管理—以國內某網站為例 / Application of Data Mining Techniques to Portal Site's Customer Relationship Management: A Case Study of Taiwan's Portal Site

柯淑貞, Ko, Shu-Chen Unknown Date (has links)
處在變化快速的網路環境中,入口網站如何建立起專屬的會員制度,以期行銷人員能在大量的會員資料庫中找出有用的資訊,掌握會員的網路行為模式、實現個人化之服務、有效區隔市場及瞭解不同會員之網路行為模式等,進而以制定適當之行銷策略而達成結合實體行銷之目標。而資料挖掘的技術能在資料量龐大的會員交易資料庫中,利用會員的基本資料與交易資料衍生建立相關的評估指標,以評估會員的特質、需求模型、消費特徵、建立市場區隔的行銷策略等,行銷人員藉此可採用不同的宣傳方式與促銷策略,以達最佳的獲利結果。 本研究以國內某入口網站真實之會員基本資料及入口網站之商品:BBS頻道與財經頻道的資料檔,做為會員網路行為模式之資料分析的基礎。本研究利用資料挖掘的技術,找出入口網站的會員與商品之分群特徵,並發掘會員在兩頻道間的網路行為的關聯規則。另一方面,本研究利用關聯規則演算法,考量實際在發掘關聯規則分析所碰到的問題,實作出一套操作流程式較為簡便的關聯規則分析程式。本研究提供不同的關聯規則分析角度,以考量會員購買商品項目組合的關聯規則,進而支援決策者制定相關商品的促銷決策,以提高銷售量。 / In the rapid-changing network environment, how do Portal Sites establish exclusive membership mechanism in order to filter useful information out of their own database, master the network behavior models of their members, realize personalized services, and effectively segment and understand different network behavior models of all members? However, data mining can use the basic members' information and transaction data to produce the associated evaluation indicator during the high volume transaction database in order to evaluate the customers’ traits, demand models, consuming characteristics, and establish the marketing strategy of segmenting target market. As a result, we can adopt different advertising types and promotion strategies to achieve the best profitable goals. The research is based on the real data of members and the merchandise of some website in Taiwan. (i.e. using data files of BBS channel and financial channel as the fundamental analysis data of network behavior models). Per using the data mining techniques, we can not only find out the characteristics of the members of portal sites and the clustering of merchandises, but unearth the association rules of network behavior of the two kinds of channels. On the other hand, this research, according to the association-ruled calculation method, is considering the practical problems when excavating association-ruled analysis methodology and producing a much simpler association-ruled analysis program. By providing the list of best buyers, the association-ruled analysis program will consider the association rule of members’ buying component and for further step, support the decision-maker to institute the related promotion strategy in order to raise the sales volume.
88

Covering or complete? : Discovering conditional inclusion dependencies

Bauckmann, Jana, Abedjan, Ziawasch, Leser, Ulf, Müller, Heiko, Naumann, Felix January 2012 (has links)
Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we define quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case. / Datenabhängigkeiten (wie zum Beispiel Integritätsbedingungen), werden verwendet, um die Qualität eines Datenbankschemas zu erhöhen, um Anfragen zu optimieren und um Konsistenz in einer Datenbank sicherzustellen. In den letzten Jahren wurden bedingte Abhängigkeiten (conditional dependencies) vorgestellt, die die Qualität von Daten analysieren und verbessern sollen. Eine bedingte Abhängigkeit ist eine Abhängigkeit mit begrenztem Gültigkeitsbereich, der über Bedingungen auf einem oder mehreren Attributen definiert wird. In diesem Bericht betrachten wir bedingte Inklusionsabhängigkeiten (conditional inclusion dependencies; CINDs). Wir generalisieren die Definition von CINDs anhand der Unterscheidung von überdeckenden (covering) und vollständigen (completeness) Bedingungen. Wir stellen einen Anwendungsfall für solche CINDs vor, der den Nutzen von CINDs bei der Lösung komplexer Datenqualitätsprobleme aufzeigt. Darüber hinaus definieren wir Qualitätsmaße für Bedingungen basierend auf Sensitivität und Genauigkeit. Wir stellen effiziente Algorithmen vor, die überdeckende und vollständige Bedingungen innerhalb vorgegebener Schwellwerte finden. Unsere Algorithmen wählen nicht nur die Werte der Bedingungen, sondern finden auch die Bedingungsattribute automatisch. Abschließend zeigen wir, dass unser Ansatz effizient sinnvolle und hilfreiche Ergebnisse für den vorgestellten Anwendungsfall liefert.
89

Datenzentrierte Bestimmung von Assoziationsregeln in parallelen Datenbankarchitekturen

Legler, Thomas 15 August 2009 (has links) (PDF)
Die folgende Arbeit befasst sich mit der Alltagstauglichkeit moderner Massendatenverarbeitung, insbesondere mit dem Problem der Assoziationsregelanalyse. Vorhandene Datenmengen wachsen stark an, aber deren Auswertung ist für ungeübte Anwender schwierig. Daher verzichten Unternehmen auf Informationen, welche prinzipiell vorhanden sind. Assoziationsregeln zeigen in diesen Daten Abhängigkeiten zwischen den Elementen eines Datenbestandes, beispielsweise zwischen verkauften Produkten. Diese Regeln können mit Interessantheitsmaßen versehen werden, welche dem Anwender das Erkennen wichtiger Zusammenhänge ermöglichen. Es werden Ansätze gezeigt, dem Nutzer die Auswertung der Daten zu erleichtern. Das betrifft sowohl die robuste Arbeitsweise der Verfahren als auch die einfache Auswertung der Regeln. Die vorgestellten Algorithmen passen sich dabei an die zu verarbeitenden Daten an, was sie von anderen Verfahren unterscheidet. Assoziationsregelsuchen benötigen die Extraktion häufiger Kombinationen (EHK). Hierfür werden Möglichkeiten gezeigt, Lösungsansätze auf die Eigenschaften moderne System anzupassen. Als Ansatz werden Verfahren zur Berechnung der häufigsten $N$ Kombinationen erläutert, welche anders als bekannte Ansätze leicht konfigurierbar sind. Moderne Systeme rechnen zudem oft verteilt. Diese Rechnerverbünde können große Datenmengen parallel verarbeiten, benötigen jedoch die Vereinigung lokaler Ergebnisse. Für verteilte Top-N-EHK auf realistischen Partitionierungen werden hierfür Ansätze mit verschiedenen Eigenschaften präsentiert. Aus den häufigen Kombinationen werden Assoziationsregeln gebildet, deren Aufbereitung ebenfalls einfach durchführbar sein soll. In der Literatur wurden viele Maße vorgestellt. Je nach den Anforderungen entsprechen sie je einer subjektiven Bewertung, allerdings nicht zwingend der des Anwenders. Hierfür wird untersucht, wie mehrere Interessantheitsmaßen zu einem globalen Maß vereinigt werden können. Dies findet Regeln, welche mehrfach wichtig erschienen. Der Nutzer kann mit den Vorschlägen sein Suchziel eingrenzen. Ein zweiter Ansatz gruppiert Regeln. Dies erfolgt über die Häufigkeiten der Regelelemente, welche die Grundlage von Interessantheitsmaßen bilden. Die Regeln einer solchen Gruppe sind daher bezüglich vieler Interessantheitsmaßen ähnlich und können gemeinsam ausgewertet werden. Dies reduziert den manuellen Aufwand des Nutzers. Diese Arbeit zeigt Möglichkeiten, Assoziationsregelsuchen auf einen breiten Benutzerkreis zu erweitern und neue Anwender zu erreichen. Die Assoziationsregelsuche wird dabei derart vereinfacht, dass sie statt als Spezialanwendung als leicht nutzbares Werkzeug zur Datenanalyse verwendet werden kann. / The importance of data mining is widely acknowledged today. Mining for association rules and frequent patterns is a central activity in data mining. Three main strategies are available for such mining: APRIORI , FP-tree-based approaches like FP-GROWTH, and algorithms based on vertical data structures and depth-first mining strategies like ECLAT and CHARM. Unfortunately, most of these algorithms are only moderately suitable for many “real-world” scenarios because their usability and the special characteristics of the data are two aspects of practical association rule mining that require further work. All mining strategies for frequent patterns use a parameter called minimum support to define a minimum occurrence frequency for searched patterns. This parameter cuts down the number of patterns searched to improve the relevance of the results. In complex business scenarios, it can be difficult and expensive to define a suitable value for the minimum support because it depends strongly on the particular datasets. Users are often unable to set this parameter for unknown datasets, and unsuitable minimum-support values can extract millions of frequent patterns and generate enormous runtimes. For this reason, it is not feasible to permit ad-hoc data mining by unskilled users. Such users do not have the knowledge and time to define suitable parameters by trial-and-error procedures. Discussions with users of SAP software have revealed great interest in the results of association-rule mining techniques, but most of these users are unable or unwilling to set very technical parameters. Given such user constraints, several studies have addressed the problem of replacing the minimum-support parameter with more intuitive top-n strategies. We have developed an adaptive mining algorithm to give untrained SAP users a tool to analyze their data easily without the need for elaborate data preparation and parameter determination. Previously implemented approaches of distributed frequent-pattern mining were expensive and time-consuming tasks for specialists. In contrast, we propose a method to accelerate and simplify the mining process by using top-n strategies and relaxing some requirements on the results, such as completeness. Unlike such data approximation techniques as sampling, our algorithm always returns exact frequency counts. The only drawback is that the result set may fail to include some of the patterns up to a specific frequency threshold. Another aspect of real-world datasets is the fact that they are often partitioned for shared-nothing architectures, following business-specific parameters like location, fiscal year, or branch office. Users may also want to conduct mining operations spanning data from different partners, even if the local data from the respective partners cannot be integrated at a single location for data security reasons or due to their large volume. Almost every data mining solution is constrained by the need to hide complexity. As far as possible, the solution should offer a simple user interface that hides technical aspects like data distribution and data preparation. Given that BW Accelerator users have such simplicity and distribution requirements, we have developed an adaptive mining algorithm to give unskilled users a tool to analyze their data easily, without the need for complex data preparation or consolidation. For example, Business Intelligence scenarios often partition large data volumes by fiscal year to enable efficient optimizations for the data used in actual workloads. For most mining queries, more than one data partition is of interest, and therefore, distribution handling that leaves the data unaffected is necessary. The algorithms presented in this paper have been developed to work with data stored in SAP BW. A salient feature of SAP BW Accelerator is that it is implemented as a distributed landscape that sits on top of a large number of shared-nothing blade servers. Its main task is to execute OLAP queries that require fast aggregation of many millions of rows of data. Therefore, the distribution of data over the dedicated storage is optimized for such workloads. Data mining scenarios use the same data from storage, but reporting takes precedence over data mining, and hence, the data cannot be redistributed without massive costs. Distribution by special data semantics or user-defined selections can produce many partitions and very different partition sizes. The handling of such real-world distributions for frequent-pattern mining is an important task, but it conflicts with the requirement of balanced partition.
90

資料採礦於資訊流通業(B2B)之應用研究—以個案公司為例

陳炳輝, Chen, Ping-Hui Unknown Date (has links)
所謂資料採礦是指『從大量資料或大型資料庫中由電腦自動選取一些重要的、潛在有用的資料類型或知識』。目前資料採礦所包含的各種技術已被廣泛的應用在許多領域上,本研究即要利用資料採礦的技術從大量的客戶交易資料中採掘出客戶與商品之間的關聯性知識,並將之應用未來客戶銷售活動。 資料採礦於流通業多為B2C之應用,本研究則嘗試將資料採礦分析應用於B2B之交易分析,並以個案公司與其客戶之實際銷售資料為本研究之資料來源,本研究利用Clementine電腦軟體為資料採礦工具,並依分析目的之不同,運用該軟體提供之各項採礦模組分別對個案公司之交易資料進行分析,如: *.使用關聯網〈web〉的方式,針對個案資料,尋找商品銷售間的強弱關係,挑出銷售關聯性較高的商品組合,並且利用C5.0決策樹演算法,尋找該交易行為的對象之特性為何。 *.使用Apriori演算法,針對BZ(商圈)、DL(經銷商)、SP(門市)等不同客戶類型在不同的資料期間,找出資料中所有商品之關聯規則。 *.利用Apriori演算法,利用前半年資料,找出IFAKMB(主機板)、IFDDLC(LCD監視器)、IFCOCP(中央處理器)等類別商品的購買規則,並分別以後半年的資料進行驗證,探究此規則之可行性。 接著針對各項資料採礦結果,就個案公司之實際狀況進行解讀,同時更重要的是探討該分析結果應用於銷售實務上之可行性,如:產品銷售規則,行銷策略、促銷戰術之擬定等。最後並以本研究之結果及經驗,對個案公司提出資訊管理系統資料補強之建議及資料採礦於未來可再延伸探討之應用方向。

Page generated in 0.0855 seconds