Global ETD Search

51	Improving RDF data with data mining Abedjan, Ziawasch January 2014 (has links) Linked Open Data (LOD) comprises very many and often large public data sets and knowledge bases. Those datasets are mostly presented in the RDF triple structure of subject, predicate, and object, where each triple represents a statement or fact. Unfortunately, the heterogeneity of available open data requires significant integration steps before it can be used in applications. Meta information, such as ontological definitions and exact range definitions of predicates, are desirable and ideally provided by an ontology. However in the context of LOD, ontologies are often incomplete or simply not available. Thus, it is useful to automatically generate meta information, such as ontological dependencies, range definitions, and topical classifications. Association rule mining, which was originally applied for sales analysis on transactional databases, is a promising and novel technique to explore such data. We designed an adaptation of this technique for min-ing Rdf data and introduce the concept of “mining configurations”, which allows us to mine RDF data sets in various ways. Different configurations enable us to identify schema and value dependencies that in combination result in interesting use cases. To this end, we present rule-based approaches for auto-completion, data enrichment, ontology improvement, and query relaxation. Auto-completion remedies the problem of inconsistent ontology usage, providing an editing user with a sorted list of commonly used predicates. A combination of different configurations step extends this approach to create completely new facts for a knowledge base. We present two approaches for fact generation, a user-based approach where a user selects the entity to be amended with new facts and a data-driven approach where an algorithm discovers entities that have to be amended with missing facts. As knowledge bases constantly grow and evolve, another approach to improve the usage of RDF data is to improve existing ontologies. Here, we present an association rule based approach to reconcile ontology and data. Interlacing different mining configurations, we infer an algorithm to discover synonymously used predicates. Those predicates can be used to expand query results and to support users during query formulation. We provide a wide range of experiments on real world datasets for each use case. The experiments and evaluations show the added value of association rule mining for the integration and usability of RDF data and confirm the appropriateness of our mining configuration methodology. / Linked Open Data (LOD) umfasst viele und oft sehr große öffentlichen Datensätze und Wissensbanken, die hauptsächlich in der RDF Triplestruktur bestehend aus Subjekt, Prädikat und Objekt vorkommen. Dabei repräsentiert jedes Triple einen Fakt. Unglücklicherweise erfordert die Heterogenität der verfügbaren öffentlichen Daten signifikante Integrationsschritte bevor die Daten in Anwendungen genutzt werden können. Meta-Daten wie ontologische Strukturen und Bereichsdefinitionen von Prädikaten sind zwar wünschenswert und idealerweise durch eine Wissensbank verfügbar. Jedoch sind Wissensbanken im Kontext von LOD oft unvollständig oder einfach nicht verfügbar. Deshalb ist es nützlich automatisch Meta-Informationen, wie ontologische Abhängigkeiten, Bereichs-und Domänendefinitionen und thematische Assoziationen von Ressourcen generieren zu können. Eine neue und vielversprechende Technik um solche Daten zu untersuchen basiert auf das entdecken von Assoziationsregeln, welche ursprünglich für Verkaufsanalysen in transaktionalen Datenbanken angewendet wurde. Wir haben eine Adaptierung dieser Technik auf RDF Daten entworfen und stellen das Konzept der Mining Konfigurationen vor, welches uns befähigt in RDF Daten auf unterschiedlichen Weisen Muster zu erkennen. Verschiedene Konfigurationen erlauben uns Schema- und Wertbeziehungen zu erkennen, die für interessante Anwendungen genutzt werden können. In dem Sinne, stellen wir assoziationsbasierte Verfahren für eine Prädikatvorschlagsverfahren, Datenvervollständigung, Ontologieverbesserung und Anfrageerleichterung vor. Das Vorschlagen von Prädikaten behandelt das Problem der inkonsistenten Verwendung von Ontologien, indem einem Benutzer, der einen neuen Fakt einem Rdf-Datensatz hinzufügen will, eine sortierte Liste von passenden Prädikaten vorgeschlagen wird. Eine Kombinierung von verschiedenen Konfigurationen erweitert dieses Verfahren sodass automatisch komplett neue Fakten für eine Wissensbank generiert werden. Hierbei stellen wir zwei Verfahren vor, einen nutzergesteuertenVerfahren, bei dem ein Nutzer die Entität aussucht die erweitert werden soll und einen datengesteuerten Ansatz, bei dem ein Algorithmus selbst die Entitäten aussucht, die mit fehlenden Fakten erweitert werden. Da Wissensbanken stetig wachsen und sich verändern, ist ein anderer Ansatz um die Verwendung von RDF Daten zu erleichtern die Verbesserung von Ontologien. Hierbei präsentieren wir ein Assoziationsregeln-basiertes Verfahren, der Daten und zugrundeliegende Ontologien zusammenführt. Durch die Verflechtung von unterschiedlichen Konfigurationen leiten wir einen neuen Algorithmus her, der gleichbedeutende Prädikate entdeckt. Diese Prädikate können benutzt werden um Ergebnisse einer Anfrage zu erweitern oder einen Nutzer während einer Anfrage zu unterstützen. Für jeden unserer vorgestellten Anwendungen präsentieren wir eine große Auswahl an Experimenten auf Realweltdatensätzen. Die Experimente und Evaluierungen zeigen den Mehrwert von Assoziationsregeln-Generierung für die Integration und Nutzbarkeit von RDF Daten und bestätigen die Angemessenheit unserer konfigurationsbasierten Methodologie um solche Regeln herzuleiten. Assoziationsregeln RDF LOD Mustererkennung Synonyme association rule mining RDF LOD knowledge discovery synonym discovery Data processing Computer science
52	一個基於記憶體內運算之多維度多顆粒度資料探勘之研究-以yahoo user profile為例 / A Research of Multi-dimensional and Multigranular Data Mining with In-memory Computingwith yahoo user profile 林洸儂, Lin, Guang-Nung Unknown Date (has links) 近年來雲端運算技術的發展與電腦設備效能提升，使得以大量電腦主機以水平擴充的方式組成叢集運算系統，成為一可行的選擇。Apache Hadoop 是Apache 基金會的一個開源軟體框架，它是由Google 公司的MapReduce 與Google 檔案系統實作成的分布式系統，可以管理數千台以上的電腦群集。Hadoop 利用分散式檔案系統HDFS 可以提供PB 級以上的資料存放空間，透過MapReduce 框架可以將應用程式分割成小工作分散到叢集中的運算節點上執行。此外，企業累積了巨量的資料，如何處理與分析這些結構化或者是非結構化的資料成了現在熱門研究的議題。因此傳統的資料挖掘方式與演算法必須因應新的雲端運算技術與分散式框架的概念，進行調整與改良，發展新的方法。關聯規則是分析資料庫龐大的資料中，項目之間隱含的關聯，常見的應用為購物籃分析。一般情形下會在特定的維度與特定的顆粒度範圍內挖掘關聯規則，但這樣的方式無法找出更細微範圍下之規則，例如挖掘一個年度的交易資料無法發現消費者在聖誕節為了慶祝而購買的商品項目間的規則，但若將時間限縮在 12 月份即可挖掘出這些規則。 Apriori 演算法是挖掘關聯規則的一個著名的演算法，透過產生候選項目集合與使用自訂的最小支持度進行篩選，產生高頻項目集合，接著以最小信賴度篩選獲得關聯規則的結果。若有k 種單一項目集合，則候選項目集合最多有2𝑘 − 1 個，計算高頻項目時則需反覆掃描整個資料庫，Apriori 這兩個主要步驟需要耗費相當大量的運算能力。因此本研究將資料庫分割成多個資料區塊挖掘關聯規則，再將結果逐步更新的演算法，解決大範圍挖掘遺失關聯規則的問題，結合spark 分散式運算的架構實作程式，在電腦群集上平行運算減少關聯規則的挖掘時間。 / Because of improving technique of cloud-computing and increasing capability of computer equipment, it is feasible to use clusters of computers by horizon scalable a lot of computers. Apache Hadoop is an open-source software of Apache. It allows the management of cluster resource, a distributed storage system named Hadoop Distributed File System (HDFS), and a parallel processing technique called MapReduce. Enterprises have accumulated a huge amount of data. It is a hot issue to process and analyze these structured or unstructured data. Traditional methods and algorithms of data mining must make adjustments and improvement to new cloud computing technology and concept of decentralized framework. Association rules is the relations of items from large database. In general, we find association rules in fixed dimensions and granular database. However, it might loss infrequent association rules. Apriori algorithm is one famous algorithm of mining association rule. There are two main steps in this algorithm spend a lot of computing resource. To generate Candidate itemset has quantity 2𝑘 − 1, if there are k different item. Second step is to find frequent, this step must compare all tractions in the database. This approach divides database to segmentations and finds association rules of these segmentations. Then, we combine rules of segmentations. It can solve the problem of missing infrequent itemset. In addition, we implement this method in Spark and reduce the time of computing. 關聯規則 Apriori 演算法資料挖掘 Association Rule Apriori Algorithm Data mining Hadoop Spark
53	Association Rule Based Classification Palanisamy, Senthil Kumar 03 May 2006 (has links) In this thesis, we focused on the construction of classification models based on association rules. Although association rules have been predominantly used for data exploration and description, the interest in using them for prediction has rapidly increased in the data mining community. In order to mine only rules that can be used for classification, we modified the well known association rule mining algorithm Apriori to handle user-defined input constraints. We considered constraints that require the presence/absence of particular items, or that limit the number of items, in the antecedents and/or the consequents of the rules. We developed a characterization of those itemsets that will potentially form rules that satisfy the given constraints. This characterization allows us to prune during itemset construction itemsets such that neither they nor any of their supersets will form valid rules. This improves the time performance of itemset construction. Using this characterization, we implemented a classification system based on association rules and compared the performance of several model construction methods, including CBA, and several model deployment modes to make predictions. Although the data mining community has dealt only with the classification of single-valued attributes, there are several domains in which the classification target is set-valued. Hence, we enhanced our classification system with a novel approach to handle the prediction of set-valued class attributes. Since the traditional classification accuracy measure is inappropriate in this context, we developed an evaluation method for set-valued classification based on the E-Measure. Furthermore, we enhanced our algorithm by not relying on the typical support/confidence framework, and instead mining for the best possible rules above a user-defined minimum confidence and within a desired range for the number of rules. This avoids long mining times that might produce large collections of rules with low predictive power. For this purpose, we developed a heuristic function to determine an initial minimum support and then adjusted it using a binary search strategy until a number of rules within the given range was obtained. We implemented all of our techniques described above in WEKA, an open source suite of machine learning algorithms. We used several datasets from the UCI Machine Learning Repository to test and evaluate our techniques. Itemset Pruning Association Rules Adaptive Minimal Support Associative Classification Classification Association rule mining Data mining Classification Data processing
54	Suporte a sistemas de auxílio ao diagnóstico e de recuperação de imagens por conteúdo usando mineração de regras de associação / Supporting Computer-Aided Diagnosis and Content-Based Image Retrieval Systems through Association Rule Mining Ribeiro, Marcela Xavier 16 December 2008 (has links) Neste trabalho, a mineração de regras de associação é utilizada para dar suporte a dois tipos de sistemas médicos: os sistemas de busca por conteúdo em imagens (Content-based Image Retrieval - CBIR) e os sistemas de auxílio ao diagnóstico (Computer Aided Diagnosis - CAD). Na busca por conteúdo, regras de associação são empregadas para reduzir a dimensionalidade dos vetores de características que representam as imagens e para diminuir o ``gap semântico\'\', que existe entre as características de baixo nível das imagens e seu significado semântico. O algoritmo StARMiner (Statistical Association Rule Miner) foi desenvolvido para associar características de baixo nível das imagens com o seu significado semântico, sendo também utilizado para realizar seleção de características em bases de imagens médicas, melhorando a precisão dos sistemas CBIR. Para dar suporte aos sistemas CAD, o método IDEA (Image Diagnosis Enhancement through Association rules) foi desenvolvido. Nesse método regras de associação são empregadas para sugerir uma segunda opinião ou diagnóstico preliminar de uma nova imagem para o radiologista. A segunda opinião automaticamente gerada pelo método pode acelerar o processo de diagnóstico de uma imagem ou reforçar uma hipótese, trazendo ao especialista médico um apoio estatístico da situação sendo analisada. Dois novos algoritmos foram propostos: um para pré-processar as características de baixo nível das imagens médicas e, o outro, para propor diagnósticos baseados em regras de associação. Vários experimentos foram realizados para validar os métodos desenvolvidos. Os experimentos realizados indicam que o uso de regras de associação pode contribuir para melhorar a busca por conteúdo e o diagnóstico de imagens médicas, consistindo numa poderosa ferramenta para descoberta de padrões em sistemas médicos / In this work we take advantage of association rule mining to support two types of medical systems: the Content-based Image Retrieval (CBIR) and the Computer-Aided Diagnosis (CAD) systems. For content-based retrieval, association rules are employed to reduce the dimensionality of the feature vectors that represent the images and to diminish the semantic gap that exists between low-level features and its high-level semantical meaning. The StARMiner (Statistical Association Rule Miner) algorithm was developed to associate low-level features with their semantical meaning. StARMiner is also employed to perform feature selection in medical image datasets, improving the precision of CBIR systems. To improve CAD systems, we developed the IDEA (Image Diagnosis Enhancement through Association rules) method. Association rules are employed to suggest a second opinion to the radiologist or a preliminary diagnosis of a new image. A second opinion automatically obtained can accelerate the process of diagnosing or strengthen a hypothesis, giving to the physician a statistical support to the decision making process. Two new algorithms are developed to support the IDEA method: to pre-process low-level features and to propose a diagnosis based on association rules. We performed several experiments to validate the developed methods. The results indicate that association rules can be successfully applied to improve CBIR and CAD systems, empowering the arsenal of techniques to support medical image analysis in medical systems Association rule mining Auxílio ao diagnóstico médico Computer-aided diagnosis Image mining Mineração de imagens Mineração de regras de associação
55	Rule Mining and Sequential Pattern Based Predictive Modeling with EMR Data Abar, Orhan 01 January 2019 (has links) Electronic medical record (EMR) data is collected on a daily basis at hospitals and other healthcare facilities to track patients’ health situations including conditions, treatments (medications, procedures), diagnostics (labs) and associated healthcare operations. Besides being useful for individual patient care and hospital operations (e.g., billing, triaging), EMRs can also be exploited for secondary data analyses to glean discriminative patterns that hold across patient cohorts for different phenotypes. These patterns in turn can yield high level insights into disease progression with interventional potential. In this dissertation, using a large scale realistic EMR dataset of over one million patients visiting University of Kentucky healthcare facilities, we explore data mining and machine learning methods for association rule (AR) mining and predictive modeling with mood and anxiety disorders as use-cases. Our first work involves analysis of existing quantitative measures of rule interestingness to assess how they align with a practicing psychiatrist’s sense of novelty/surprise corresponding to ARs identified from EMRs. Our second effort involves mining causal ARs with depression and anxiety disorders as target conditions through matching methods accounting for computationally identified confounding attributes. Our final effort involves efficient implementation (via GPUs) and application of contrast pattern mining to predictive modeling for mental conditions using various representational methods and recurrent neural networks. Overall, we demonstrate the effectiveness of rule mining methods in secondary analyses of EMR data for identifying causal associations and building predictive models for diseases. NLP Machine Learning Deep Learning Association Rule Mining Contrast Sequential Rule Mining Causal Association Artificial Intelligence and Robotics
56	Data Mining Using Neural Networks Rahman, Sardar Muhammad Monzurur, mrahman99@yahoo.com January 2006 (has links) Data mining is about the search for relationships and global patterns in large databases that are increasing in size. Data mining is beneficial for anyone who has a huge amount of data, for example, customer and business data, transaction, marketing, financial, manufacturing and web data etc. The results of data mining are also referred to as knowledge in the form of rules, regularities and constraints. Rule mining is one of the popular data mining methods since rules provide concise statements of potentially important information that is easily understood by end users and also actionable patterns. At present rule mining has received a good deal of attention and enthusiasm from data mining researchers since rule mining is capable of solving many data mining problems such as classification, association, customer profiling, summarization, segmentation and many others. This thesis makes several contributions by proposing rule mining methods using genetic algorithms and neural networks. The thesis first proposes rule mining methods using a genetic algorithm. These methods are based on an integrated framework but capable of mining three major classes of rules. Moreover, the rule mining processes in these methods are controlled by tuning of two data mining measures such as support and confidence. The thesis shows how to build data mining predictive models using the resultant rules of the proposed methods. Another key contribution of the thesis is the proposal of rule mining methods using supervised neural networks. The thesis mathematically analyses the Widrow-Hoff learning algorithm of a single-layered neural network, which results in a foundation for rule mining algorithms using single-layered neural networks. Three rule mining algorithms using single-layered neural networks are proposed for the three major classes of rules on the basis of the proposed theorems. The thesis also looks at the problem of rule mining where user guidance is absent. The thesis proposes a guided rule mining system to overcome this problem. The thesis extends this work further by comparing the performance of the algorithm used in the proposed guided rule mining system with Apriori data mining algorithm. Finally, the thesis studies the Kohonen self-organization map as an unsupervised neural network for rule mining algorithms. Two approaches are adopted based on the way of self-organization maps applied in rule mining models. In the first approach, self-organization map is used for clustering, which provides class information to the rule mining process. In the second approach, automated rule mining takes the place of trained neurons as it grows in a hierarchical structure. Data mining Neural network Genetic algorithm Self-organizing map association rule characteristic rule classification rule rule mining
57	Fuzzy-Granular Based Data Mining for Effective Decision Support in Biomedical Applications He, Yuanchen 04 December 2006 (has links) Due to complexity of biomedical problems, adaptive and intelligent knowledge discovery and data mining systems are highly needed to help humans to understand the inherent mechanism of diseases. For biomedical classification problems, typically it is impossible to build a perfect classifier with 100% prediction accuracy. Hence a more realistic target is to build an effective Decision Support System (DSS). In this dissertation, a novel adaptive Fuzzy Association Rules (FARs) mining algorithm, named FARM-DS, is proposed to build such a DSS for binary classification problems in the biomedical domain. Empirical studies show that FARM-DS is competitive to state-of-the-art classifiers in terms of prediction accuracy. More importantly, FARs can provide strong decision support on disease diagnoses due to their easy interpretability. This dissertation also proposes a fuzzy-granular method to select informative and discriminative genes from huge microarray gene expression data. With fuzzy granulation, information loss in the process of gene selection is decreased. As a result, more informative genes for cancer classification are selected and more accurate classifiers can be modeled. Empirical studies show that the proposed method is more accurate than traditional algorithms for cancer classification. And hence we expect that genes being selected can be more helpful for further biological studies. Granular Computing Fuzzy Association Rule Mining Decision Support System Binary Classification Bioinformatics Computational Intelligence Data Mining Knowledge Discovery Computer Sciences
58	Design of Comprehensible Learning Machine Systems for Protein Structure Prediction Hu, Hae-Jin 06 August 2007 (has links) With the efforts to understand the protein structure, many computational approaches have been made recently. Among them, the Support Vector Machine (SVM) methods have been recently applied and showed successful performance compared with other machine learning schemes. However, despite the high performance, the SVM approaches suffer from the problem of understandability since it is a black-box model; the predictions made by SVM cannot be interpreted as biologically meaningful way. To overcome this limitation, a new association rule based classifier PCPAR was devised based on the existing classifier, CPAR to handle the sequential data. The performance of the PCPAR was improved more by designing the following two hybrid schemes. The PCPAR/SVM method is a parallel combination of the PCPAR and the SVM and the PCPAR_SVM method is a sequential combination of the PCPAR and the SVM. To understand the SVM prediction, the SVM_PCPAR scheme was developed. The experimental result presents that the PCPAR scheme shows better performance with respect to the accuracy and the number of generated patterns than CPAR method. The PCPAR/SVM scheme presents better performance than the PCPAR, PCPAR_SVM or the SVM_PCPAR and almost equal performance to the SVM. The generated patterns are easily understandable and biologically meaningful. The system sturdiness evaluation and the ROC curve analysis proved that this new scheme is robust and competent. Embedded Membrane Segment Prediction Transmembrane Receiver Operating Characteristic Association Rule Based Classifier Data Mining Support Vector Machines Computer Sciences
59	Discovery and Extraction of Protein Sequence Motif Information that Transcends Protein Family Boundaries Chen, Bernard 17 July 2009 (has links) Protein sequence motifs are gathering more and more attention in the field of sequence analysis. The recurring patterns have the potential to determine the conformation, function and activities of the proteins. In our work, we obtained protein sequence motifs which are universally conserved across protein family boundaries. Therefore, unlike most popular motif discovering algorithms, our input dataset is extremely large. As a result, an efficient technique is essential. We use two granular computing models, Fuzzy Improved K-means (FIK) and Fuzzy Greedy K-means (FGK), in order to efficiently generate protein motif information. After that, we develop an efficient Super Granular SVM Feature Elimination model to further extract the motif information. During the motifs searching process, setting up a fixed window size in advance may simplify the computational complexity and increase the efficiency. However, due to the fixed size, our model may deliver a number of similar motifs simply shifted by some bases or including mismatches. We develop a new strategy named Positional Association Super-Rule to confront the problem of motifs generated from a fixed window size. It is a combination approach of the super-rule analysis and a novel Positional Association Rule algorithm. We use the super-rule concept to construct a Super-Rule-Tree (SRT) by a modified HHK clustering, which requires no parameter setup to identify the similarities and dissimilarities between the motifs. The positional association rule is created and applied to search similar motifs that are shifted some residues. By analyzing the motifs results generated by our approaches, we realize that these motifs are not only significant in sequence area, but also in secondary structure similarity and biochemical properties. Positional Association Rule Super-Rule protein sequence motif FIK model FGK model Super GSVM-FE HHK clustering algorithm Computer Sciences
60	A data mining approach to ontology learning for automatic content-related question-answering in MOOCs Shatnawi, Safwan January 2016 (has links) The advent of Massive Open Online Courses (MOOCs) allows massive volume of registrants to enrol in these MOOCs. This research aims to offer MOOCs registrants with automatic content related feedback to fulfil their cognitive needs. A framework is proposed which consists of three modules which are the subject ontology learning module, the short text classification module, and the question answering module. Unlike previous research, to identify relevant concepts for ontology learning a regular expression parser approach is used. Also, the relevant concepts are extracted from unstructured documents. To build the concept hierarchy, a frequent pattern mining approach is used which is guided by a heuristic function to ensure that sibling concepts are at the same level in the hierarchy. As this process does not require specific lexical or syntactic information, it can be applied to any subject. To validate the approach, the resulting ontology is used in a question-answering system which analyses students' content-related questions and generates answers for them. Textbook end of chapter questions/answers are used to validate the question-answering system. The resulting ontology is compared vs. the use of Text2Onto for the question-answering system, and it achieved favourable results. Finally, different indexing approaches based on a subject's ontology are investigated when classifying short text in MOOCs forum discussion data; the investigated indexing approaches are: unigram-based, concept-based and hierarchical concept indexing. The experimental results show that the ontology-based feature indexing approaches outperform the unigram-based indexing approach. Experiments are done in binary classification and multiple labels classification settings . The results are consistent and show that hierarchical concept indexing outperforms both concept-based and unigram-based indexing. The BAGGING and random forests classifiers achieved the best result among the tested classifiers. 371.33

Search results