Global ETD Search

1	以規則為基礎的分類演算法：應用粗糙集 / A Rule-Based classification algorithm: a rough set approach 廖家奇, Liao, Chia Chi Unknown Date (has links) 在本論文中，我們提出了一個以規則為基礎的分類演算法，名為ROUSER（ROUgh SEt Rule），它利用粗糙集理論作為搜尋啟發的基礎，進而建立規則。我們使用一個已經被廣泛利用的工具實作ROUSER，也使用數個公開資料集對它進行實驗，並將它應用於真實世界的案例。本論文的初衷可被追溯到一個真實世界的案例，而此案例的目標是從感應器所蒐集的資料中找出與機械故障之間的關聯。為了能支援機械故障的根本原因分析，我們設計並實作了一個以規則為基礎的分類演算法，它所產生的模型是由人類可理解的決策規則所組成，而故障的徵兆與原因則被決策規則所連結。此外，資料中存在著矛盾。舉例而言，不同時間點所蒐集的兩筆紀錄極為相似、甚至相同（除了時間戳記），但其中一筆紀錄與機械故障相關，另一筆則否。本案例的挑戰在於分析矛盾的資料。我們使用粗糙集理論克服這個難題，因為它可以處理不完美知識。研究者們已經提出了各種不同的分類演算法，而實踐者們則已經將它們應用於各種領域，然而多數分類演算法的設計並不強調演算法所產生模型的可解釋性與可理解性。ROUSER的設計是專門從名目資料中萃取人類可理解的決策規則。而ROUSER與其它多數規則分類演算法不同的地方是利用粗糙集方法選取特徵。ROUSER也提供了數種方式來選擇合宜的屬性與值配對，作為規則的前項。此外，ROUSER的規則產生方法是基於separate-and-conquer策略，因此比其它基於粗糙集的分類演算法所廣泛採用的不可分辨矩陣方法還有效率。我們進行延伸實驗來驗證ROUSER的能力。對於名目資料的實驗裡，ROUSER在半數的結果中的準確率可匹敵、甚至勝過其他以規則為基礎的分類演算法以及決策樹分類演算法。ROUSER也可以在一些離散化的資料集之中達到可匹敵甚至超越的準確率。我們也提供了內建的特徵萃取方法與其它方法的比較的實驗結果，以及數種用來決定規則前項的方法的實驗結果。 / In this thesis, we propose a rule-based classification algorithm named ROUSER (ROUgh SEt Rule), which uses the rough set theory as the basis of the search heuristics in the process of rule generation. We implement ROUSER using a well developed and widely used toolkit, evaluate it using several public data sets, and examine its applicability using a real-world case study. The origin of the problem addressed in this thesis can be traced back to a real-world problem where the goal is to determine whether a data record collected from a sensor corresponds to a machine fault. In order to assist in the root cause analysis of the machine faults, we design and implement a rule-based classification algorithm that can generate models consisting of human understandable decision rules to connect symptoms to the cause. Moreover, there are contradictions in data. For example, two data records collected at different time points are similar, or the same (except their timestamps), while one is corresponding to a machine fault but not the other. The challenge is to analyze data with contradictions. We use the rough set theory to overcome the challenge, since it is able to process imperfect knowledge. Researchers have proposed various classification algorithms and practitioners have applied them to various application domains, while most of the classification algorithms are designed without a focus on interpretability or understandability of the models built using the algorithms. ROUSER is specifically designed to extract human understandable decision rules from nominal data. What distinguishes ROUSER from most, if not all, other rule-based classification algorithms is that it utilizes a rough set approach to select features. ROUSER also provides several ways to decide an appropriate attribute-value pair for the antecedents of a rule. Moreover, the rule generation method of ROUSER is based on the separate-and-conquer strategy, and hence it is more efficient than the indiscernibility matrix method that is widely adopted in the classification algorithms based on the rough set theory. We conduct extensive experiments to evaluate the capability of ROUSER. On about half of the nominal data sets considered in experiments, ROUSER can achieve comparable or better accuracy than do classification algorithms that are able to generate decision rules or trees. On some of the discretized data sets, ROUSER can achieve comparable or better accuracy. We also present the results of the experiments on the embedded feature selection method and several ways to decide an appropriate attribute-value pair for the antecedents of a rule. 資料探勘分類粗糙集規則學習規則歸納 data mining classification rough set rule learning separate-and-conquer rule induction
2	The research on chinese text multi-label classification / Avancée en classification multi-labels de textes en langue chinoise / 中文文本多标签分类研究 Wei, Zhihua 07 May 2010 (has links) Text Classification (TC) which is an important field in information technology has many valuable applications. When facing the sea of information resources, the objects of TC are more complicated and diversity. The researches in pursuit of effective and practical TC technology are fairly challenging. More and more researchers regard that multi-label TC is more suited for many applications. This thesis analyses the difficulties and problems in multi-label TC and Chinese text representation based on a mass of algorithms for single-label TC and multi-label TC. Aiming at high dimensionality in feature space, sparse distribution in text representation and poor performance of multi-label classifier, this thesis will bring forward corresponding algorithms from different angles.Focusing on the problem of dimensionality “disaster” when Chinese texts are represented by using n-grams, two-step feature selection algorithm is constructed. The method combines filtering rare features within class and selecting discriminative features across classes. Moreover, the proper value of “n”, the strategy of feature weight and the correlation among features are discussed based on variety of experiments. Some useful conclusions are contributed to the research of n-gram representation in Chinese texts.In a view of the disadvantage in Latent Dirichlet Allocation (LDA) model, that is, arbitrarily revising the variable in smooth process, a new strategy for smoothing based on Tolerance Rough Set (TRS) is put forward. It constructs tolerant class in global vocabulary database firstly and then assigns value for out-of-vocabulary (oov) word in each class according to tolerant class.In order to improve performance of multi-label classifier and degrade computing complexity, a new TC method based on LDA model is applied for Chinese text representation. It extracts topics statistically from texts and then texts are represented by using the topic vector. It shows competitive performance both in English and in Chinese corpus.To enhance the performance of classifiers in multi-label TC, a compound classification framework is raised. It partitions the text space by computing the upper approximation and lower approximation. This algorithm decomposes a multi-label TC problem into several single-label TCs and several multi-label TCs which have less labels than original problem. That is, an unknown text should be classified by single-label classifier when it is partitioned into lower approximation space of some class. Otherwise, it should be classified by corresponding multi-label classifier.An application system TJ-MLWC (Tongji Multi-label Web Classifier) was designed. It could call the result from Search Engines directly and classify these results real-time using improved Naïve Bayes classifier. This makes the browse process more conveniently for users. Users could locate the texts interested immediately according to the class information given by TJ-MLWC. / La thèse est centrée sur la Classification de texte, domaine en pleine expansion, avec de nombreuses applications actuelles et potentielles. Les apports principaux de la thèse portent sur deux points : Les spécificités du codage et du traitement automatique de la langue chinoise : mots pouvant être composés de un, deux ou trois caractères ; absence de séparation typographique entre les mots ; grand nombre d’ordres possibles entre les mots d’une phrase ; tout ceci aboutissant à des problèmes difficiles d’ambiguïté. La solution du codage en «n-grams »(suite de n=1, ou 2 ou 3 caractères) est particulièrement adaptée à la langue chinoise, car elle est rapide et ne nécessite pas les étapes préalables de reconnaissance des mots à l’aide d’un dictionnaire, ni leur séparation. La classification multi-labels, c'est-à-dire quand chaque individus peut être affecté à une ou plusieurs classes. Dans le cas des textes, on cherche des classes qui correspondent à des thèmes (topics) ; un même texte pouvant être rattaché à un ou plusieurs thème. Cette approche multilabel est plus générale : un même patient peut être atteint de plusieurs pathologies ; une même entreprise peut être active dans plusieurs secteurs industriels ou de services. La thèse analyse ces problèmes et tente de leur apporter des solutions, d’abord pour les classifieurs unilabels, puis multi-labels. Parmi les difficultés, la définition des variables caractérisant les textes, leur grand nombre, le traitement des tableaux creux (beaucoup de zéros dans la matrice croisant les textes et les descripteurs), et les performances relativement mauvaises des classifieurs multi-classes habituels. / 文本分类是信息科学中一个重要而且富有实际应用价值的研究领域。随着文本分类处理内容日趋复杂化和多元化，分类目标也逐渐多样化，研究有效的、切合实际应用需求的文本分类技术成为一个很有挑战性的任务，对多标签分类的研究应运而生。本文在对大量的单标签和多标签文本分类算法进行分析和研究的基础上，针对文本表示中特征高维问题、数据稀疏问题和多标签分类中分类复杂度高而精度低的问题，从不同的角度尝试运用粗糙集理论加以解决，提出了相应的算法，主要包括：针对n-gram作为中文文本特征时带来的维数灾难问题，提出了两步特征选择的方法，即去除类内稀有特征和类间特征选择相结合的方法，并就n-gram作为特征时的n值选取、特征权重的选择和特征相关性等问题在大规模中文语料库上进行了大量的实验，得出一些有用的结论。针对文本分类中运用高维特征表示文本带来的分类效率低，开销大等问题，提出了基于LDA模型的多标签文本分类算法，利用LDA模型提取的主题作为文本特征，构建高效的分类器。在PT3多标签分类转换方法下，该分类算法在中英文数据集上都表现出很好的效果，与目前公认最好的多标签分类方法效果相当。针对LDA模型现有平滑策略的随意性和武断性的缺点，提出了基于容差粗糙集的LDA语言模型平滑策略。该平滑策略首先在全局词表上构造词的容差类，再根据容差类中词的频率为每类文档的未登录词赋予平滑值。在中英文、平衡和不平衡语料库上的大量实验都表明该平滑方法显著提高了LDA模型的分类性能，在不平衡语料库上的提高尤其明显。针对多标签分类中分类复杂度高而精度低的问题，提出了一种基于可变精度粗糙集的复合多标签文本分类框架，该框架通过可变精度粗糙集方法划分文本特征空间，进而将多标签分类问题分解为若干个两类单标签分类问题和若干个标签数减少了的多标签分类问题。即，当一篇未知文本被划分到某一类文本的下近似区域时，可以直接用简单的单标签文本分类器判断其类别；当未知文本被划分在边界域时，则采用相应区域的多标签分类器进行分类。实验表明，这种分类框架下，分类的精确度和算法效率都有较大的提高。本文还设计和实现了一个基于多标签分类的网页搜索结果可视化系统（MLWC），该系统能够直接调用搜索引擎返回的搜索结果，并采用改进的Naïve Bayes多标签分类算法实现实时的搜索结果分类，使用户可以快速地定位搜索结果中感兴趣的文本。 La Classification de texte N-grams Codage de la texte chiniose La classification multi-labels Latent Dirichlet Model L’ensembles approximatifs Assouplissement Le corpus de textes chinois multi-labels Chinese text classification Text representation Multi-label classification Rough Set Latent Dirichlet Allocation (LDA) Classification method Smoothing model Chinese text multi-label corpus 中文文本分类文本表示多标签分类 N-gram 粗糙集隐含狄利克雷分配分类器设计同济多标签网页分类系统中文文本多标签语料库

Search results

以規則為基礎的分類演算法：應用粗糙集 / A Rule-Based classification algorithm: a rough set approach

The research on chinese text multi-label classification / Avancée en classification multi-labels de textes en langue chinoise / 中文文本多标签分类研究