Spelling suggestions: "subject:"data minining"" "subject:"data chanining""
311 |
Automatic software testing via mining software data. / 基於軟件數據挖掘的自動軟件測試 / CUHK electronic theses & dissertations collection / Ji yu ruan jian shu ju wa jue de zi dong ruan jian ce shiJanuary 2011 (has links)
Zheng, Wujie. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 128-141). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
312 |
Modeling and optimization of industrial systems with data miningHe, Xiaofei 01 May 2014 (has links)
Energy efficiency of industrial systems is of great concern to many. Modeling and optimization of industrial systems has been an active research area aiming at improvement of energy efficiency of these systems. Traditional analytical and physics-based methods, reported in literature, limit modeling industrial systems, which are complex, nonlinear, and dynamic.
Due to progress in data collection techniques, large volume of data has been collected and stored for analysis. Although much valuable information is contained in such data, utilization of the data in modeling industrial systems is lagging. Data mining is a novel science, providing a platform and techniques to model complex systems and processes. Data mining techniques have been widely applied in modeling various systems.
In this Thesis, two energy intensive industrial systems are investigated, a pump system in wastewater treatment plants, and an HVAC system in commercial buildings. Data mining is utilized to derive models describing the relationship between target, operational cost of systems, and system control variables. An optimization model is constructed to minimize operational cost of a system, and intelligent algorithms are employed to solve the optimization models. The study demonstrates a considerable energy saving by applying the proposed control strategy.
The approach developed in this Thesis can be applied to industrial systems other than the pump and HVAC systems.
|
313 |
Random Relational RulesAnderson, Grant January 2008 (has links)
In the field of machine learning, methods for learning from single-table data have received much more attention than those for learning from multi-table, or relational data, which are generally more computationally complex. However, a significant amount of the world's data is relational. This indicates a need for algorithms that can operate efficiently on relational data and exploit the larger body of work produced in the area of single-table techniques. This thesis presents algorithms for learning from relational data that mitigate, to some extent, the complexity normally associated with such learning. All algorithms in this thesis are based on the generation of random relational rules. The assumption is that random rules enable efficient and effective relational learning, and this thesis presents evidence that this is indeed the case. To this end, a system for generating random relational rules is described, and algorithms using these rules are evaluated. These algorithms include direct classification, classification by propositionalisation, clustering, semi-supervised learning and generating random forests. The experimental results show that these algorithms perform competitively with previously published results for the datasets used, while often exhibiting lower runtime than other tested systems. This demonstrates that sufficient information for classification and clustering is retained in the rule generation process and that learning with random rules is efficient. Further applications of random rules are investigated. Propositionalisation allows single-table algorithms for classification and clustering to be applied to the resulting data, reducing the amount of relational processing required. Further results show that techniques for utilising additional unlabeled training data improve accuracy of classification in the semi-supervised setting. The thesis also develops a novel algorithm for building random forests by making efficient use of random rules to generate trees and leaves in parallel.
|
314 |
The Discovery of Interacting Episodes and Temporal Rule Determination in Sequential Pattern MiningMooney, Carl Howard, carl.mooney@bigpond.com January 2007 (has links)
The reason for data mining is to generate rules that can be used as the basis for making
decisions. One such area is sequence mining which, in terms of transactional datasets,
can be stated as the discovery of inter-transaction associations or associations between
different transactions. The data used for sequence mining is not limited to data stored
in overtly temporal or longitudinally maintained datasets and in such domains data can
be viewed as a series of events, or episodes, occurring at specific times. The problem
thus becomes a search for collections of events that occur frequently together.
While the mining of frequent episodes is an important capability, the manner in
which such episodes interact can provide further useful knowledge in the search for a
description of the behaviour of a phenomenon but as yet has received little investigation.
Moreover, while many sequences are associated with absolute time values, most
sequence mining routines treat time in a relative sense, returning only patterns that
can be described in terms of Allen-style relationships (or simpler), ie. nothing about
the relative pace of occurrence. They thus produce rules with a more limited expressive
power. Up to this point in time temporal interval patterns have been based on
the endpoints of the intervals, however in many cases the natural point of reference is
the midpoint of an interval and it is therefore appropriate to develop a mechanism for
reasoning between intervals when midpoint information is known.
This thesis presents a method for discovering interacting episodes from temporal
sequences and the analysis of them using temporal patterns. The mining can be conducted
both with and without the mechanism for handling the pace of events and
the analysis is conducted using both the traditional interval algebras and a midpoint
algebra presented in this thesis.
The visualisation of rules in data mining is a large and dynamic field in its own right
and although there has been a great deal of research in the visualisation of associations,
there has been little in the area of sequence or episodic mining. Add to this the emerging
field of mining stream data and there is a need to pursue methods and structures for
such visualisations, and as such this thesis also contributes toward research in this
important area of visualisation.
|
315 |
Learning from large data : Bias, variance, sampling, and learning curvesBrain, Damien, mikewood@deakin.edu.au January 2003 (has links)
One of the fundamental machine learning tasks is that of predictive classification. Given that organisations collect an ever increasing amount of data, predictive classification methods must be able to effectively and efficiently handle large amounts of data. However, it is understood that present requirements push existing algorithms to, and sometimes beyond, their limits since many classification prediction algorithms were designed when currently common data set sizes were beyond imagination.
This has led to a significant amount of research into ways of making classification learning algorithms more effective and efficient. Although substantial progress has been made, a number of key questions have not been answered.
This dissertation investigates two of these key questions. The first is whether different types of algorithms to those currently employed are required when using large data sets. This is answered by analysis of the way in which the bias plus variance decomposition of predictive classification error changes as training set size is increased. Experiments find that larger training sets require different types of algorithms to those currently used. Some insight into the characteristics of suitable algorithms is provided, and this may provide some direction for the development of future classification prediction algorithms which are specifically designed for use with large data sets.
The second question investigated is that of the role of sampling in machine learning with large data sets. Sampling has long been used as a means of avoiding the need to scale up algorithms to suit the size of the data set by scaling down the size of the data sets to suit the algorithm. However, the costs of performing sampling have not been widely explored. Two popular sampling methods are compared with learning from all available data in terms of predictive accuracy,
model complexity, and execution time. The comparison shows that sub-sampling generally products models with accuracy close to, and sometimes greater than, that obtainable from learning with all available data. This result suggests that it may be possible to develop algorithms that take advantage of the sub-sampling methodology to reduce the time required to infer a model while sacrificing little if any accuracy.
Methods of improving effective and efficient learning via sampling are also investigated, and now sampling methodologies proposed. These methodologies include using a varying-proportion of instances to determine the next inference step and using a statistical calculation at each inference step to determine sufficient sample size. Experiments show that using a statistical calculation of sample size can not only substantially reduce execution time but can do so with only a small loss, and occasional gain, in accuracy.
One of the common uses of sampling is in the construction of learning curves. Learning curves are often used to attempt to determine the optimal training size which will maximally reduce execution time while nut being detrimental to accuracy. An analysis of the performance of methods for detection of convergence of learning curves is performed, with the focus of the analysis on methods that calculate the gradient, of the tangent to the curve. Given that such methods can be susceptible to local accuracy plateaus, an investigation into the frequency of local plateaus is also performed. It is shown that local accuracy plateaus are a common occurrence, and that ensuring a small loss of accuracy often results in greater computational cost than learning from all available data. These results cast doubt over the applicability of gradient of tangent methods for detecting convergence, and of the viability of learning curves for reducing execution time in general.
|
316 |
Automated Information Extraction to Support Biomedical Decision Model Construction: A Preliminary DesignLi, Xiaoli, Leong, Tze Yun 01 1900 (has links)
We propose an information extraction framework to support automated construction of decision models in biomedicine. Our proposed technique classifies text-based documents from a large biomedical literature repository, e.g., MEDLINE, into predefined categories, and identifies important keywords for each category based on their discriminative power. Relevant documents for each category are retrieved based on the keywords, and a classification algorithm is developed based on machine learning techniques to build the final classifier. We apply the HITS algorithm to select the authoritative and typical documents within a category, and construct templates in the form of Bayesian networks. Data mining and information extraction techniques are then applied to extract the necessary semantic knowledge to fill in the templates to construct the final decision models. / Singapore-MIT Alliance (SMA)
|
317 |
An information criterion for use in predictive data mining /Kyper, Eric S. January 2006 (has links)
Thesis (Ph. D.)--University of Rhode Island, 2006. / Typescript. Includes bibliographical references (leaves 118-126).
|
318 |
A Web-based tool for analysis of crime laboratory dataAnnabathula, Ramesh. January 2007 (has links)
Thesis (M.S.)--West Virginia University, 2007. / Title from document title page. Document formatted into pages; contains ix, 110 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 100-102).
|
319 |
Multimedia Data Mining and Retrieval for Multimedia Databases Using Associations and CorrelationsLin, Lin 23 June 2010 (has links)
With the explosion in the complexity and amount of pervasive multimedia data, there are high demands of multimedia services and applications in various areas for people to easily access and distribute multimedia data. Facing with abundance multimedia resources but inefficient and rather old-fashioned keyword-based information retrieval approaches, a content-based multimedia information retrieval (CBMIR) system is required to (i) reduce the dimension space for storage saving and computation reduction; (ii) advance multimedia learning methods to accurately identify target semantics for bridging the semantics between low-level/mid-level features and high-level semantics; and (iii) effectively search media content for dynamical media delivery and enable the extensive applications to be media-type driven. This research mainly focuses on multimedia data mining and retrieval system for multimedia databases by addressing some main challenges, such as data imbalance, data quality, semantic gap, user subjectivity and searching issues. Therefore, a novel CBMIR system is proposed in this dissertation. The proposed system utilizes both association rule mining (ARM) technique and multiple correspondence analysis (MCA) technique by taking into account both pattern discovery and statistical analysis. First, media content is represented by the global and local low-level and mid-level features and stored in the multimedia database. Second, a data filtering component is proposed in the system to improve the data quality and reduce the data imbalance. To be specific, the proposed filtering step is able to vertically select features and horizontally prune instances in multimedia databases. Third, a new learning and classification method mining weighted association rules is proposed in the retrieval system. The MCA-based correlation is used to generate and select the weighted N-feature-value pair rules, where the N varies from one to many. Forth, a ranking method independent of classifiers is proposed in the system to sort the retrieved results and put the most interesting ones on the top of the browsing list. Finally, a user interface is implemented in CBMIR system that allows the user to choose his/her interested concept, searches media based on the target concept, ranks the retrieved segments using the proposed ranking algorithm, and then displays the top-ranked segments to the user. The system is experimented with various high-level semantics from TRECVID benchmark data sets. TRECVID sound and vision data is a large data set, includes various types of videos, and has very rich semantics. Overall, the proposed system achieves promising results in comparison with the other well-known methods. Moreover, experiments that compare each component with some other famous algorithms are conducted. The experimental results show that all proposed components improve the functionalities of the CBMIR system, and the proposed system reaches effectiveness, robustness and efficiency for a high-dimensional multimedia database.
|
320 |
Temporal data mining in a dynamic feature space /Wenerstrom, Brent, January 2006 (has links) (PDF)
Thesis (M.S.)--Brigham Young University. Dept. of Computer Science, 2006. / Includes bibliographical references (p. 43-45).
|
Page generated in 0.0906 seconds