Spelling suggestions: "subject:"data minining"" "subject:"data chanining""
61 |
Measuring academic performance of students in Higher Education using data mining techniquesAlsuwaiket, Mohammed January 2018 (has links)
Educational Data Mining (EDM) is a developing discipline, concerned with expanding the classical Data Mining (DM) methods and developing new methods for discovering the data that originate from educational systems. It aims to use those methods to achieve a logical understanding of students, and the educational environment they should have for better learning. These data are characterized by their large size and randomness and this can make it difficult for educators to extract knowledge from these data. Additionally, knowledge extracted from data by means of counting the occurrence of certain events is not always reliable, since the counting process sometimes does not take into consideration other factors and parameters that could affect the extracted knowledge. Student attendance in Higher Education has always been dealt with in a classical way, i.e. educators rely on counting the occurrence of attendance or absence building their knowledge about students as well as modules based on this count. This method is neither credible nor does it necessarily provide a real indication of a student s performance. On other hand, the choice of an effective student assessment method is an issue of interest in Higher Education. Various studies (Romero, et al., 2010) have shown that students tend to get higher marks when assessed through coursework-based assessment methods - which include either modules that are fully assessed through coursework or a mixture of coursework and examinations than assessed by examination alone. There are a large number of Educational Data Mining (EDM) studies that pre-processed data through the conventional Data Mining processes including the data preparation process, but they are using transcript data as it stands without looking at examination and coursework results weighting which could affect prediction accuracy. This thesis explores the above problems and tries to formulate the extracted knowledge in a way that guarantees achieving accurate and credible results. Student attendance data, gathered from the educational system, were first cleaned in order to remove any randomness and noise, then various attributes were studied so as to highlight the most significant ones that affect the real attendance of students. The next step was to derive an equation that measures the Student Attendance s Credibility (SAC) considering the attributes chosen in the previous step. The reliability of the newly developed measure was then evaluated in order to examine its consistency. In term of transcripts data, this thesis proposes a different data preparation process through investigating more than 230,000 student records in order to prepare students marks based on the assessment methods of enrolled modules. The data have been processed through different stages in order to extract a categorical factor through which students module marks are refined during the data preparation process. The results of this work show that students final marks should not be isolated from the nature of the enrolled module s assessment methods; rather they must be investigated thoroughly and considered during EDM s data pre-processing phases. More generally, it is concluded that Educational Data should not be prepared in the same way as exist data due to the differences such as sources of data, applications, and types of errors in them. Therefore, an attribute, Coursework Assessment Ratio (CAR), is proposed to use in order to take the different modules assessment methods into account while preparing student transcript data. The effect of CAR and SAC on prediction process using data mining classification techniques such as Random Forest, Artificial Neural Networks and k-Nears Neighbors have been investigated. The results were generated by applying the DM techniques on our data set and evaluated by measuring the statistical differences between Classification Accuracy (CA) and Root Mean Square Error (RMSE) of all models. Comprehensive evaluation has been carried out for all results in the experiments to compare all DM techniques results, and it has been found that Random forest (RF) has the highest CA and lowest RMSE. The importance of SAC and CAR in increasing the prediction accuracy has been proved in Chapter 5. Finally, the results have been compared with previous studies that predicted students final marks, based on students marks at earlier stages of their study. The comparisons have taken into consideration similar data and attributes, whilst first excluding average CAR and SAC and secondly by including them, and then measuring the prediction accuracy between both. The aim of this comparison is to ensure that the new preparation process stage will positively affect the final results.
|
62 |
Blog content mining: topic identification and evolution extraction.January 2009 (has links)
Ng, Kuan Kit. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2009. / Includes bibliographical references (leaves 92-100). / Abstract also in Chinese. / Abstract --- p.i / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Blog Overview --- p.2 / Chapter 1.2 --- Motivation --- p.4 / Chapter 1.2.1 --- Blog Mining --- p.5 / Chapter 1.2.2 --- Topic Detection and Tracking --- p.8 / Chapter 1.3 --- Objectives and Contributions --- p.9 / Chapter 1.4 --- Proposed Methodology --- p.11 / Chapter 2 --- Related Work --- p.13 / Chapter 2.1 --- Web Document Clustering --- p.13 / Chapter 2.2 --- Document Clustering with Temporal Information --- p.15 / Chapter 2.3 --- Blog Mining --- p.17 / Chapter 3 --- Feature Extraction and Selection --- p.20 / Chapter 3.1 --- Blog Extraction and Content Cleaning --- p.21 / Chapter 3.1.1 --- Blog Parsing and Structure Identification --- p.22 / Chapter 3.1.2 --- Stop-word Removal --- p.24 / Chapter 3.1.3 --- Word Stemming --- p.25 / Chapter 3.1.4 --- Heuristic Content Cleaning and Multiword Grouping --- p.25 / Chapter 3.2 --- Feature Selection --- p.26 / Chapter 3.2.1 --- Term Frequency Inverse Document Frequency --- p.27 / Chapter 3.2.2 --- Term Contribution --- p.29 / Chapter 4 --- Blog Topic Extraction --- p.31 / Chapter 4.1 --- Requirements of Document Clustering --- p.32 / Chapter 4.1.1 --- Vector Space Modeling --- p.32 / Chapter 4.1.2 --- Similarity Measurement --- p.33 / Chapter 4.2 --- Document Clustering --- p.34 / Chapter 4.2.1 --- Partitional Clustering --- p.36 / Chapter 4.2.2 --- Hierarchial Clustering --- p.37 / Chapter 4.2.3 --- Density-Based Clustering --- p.38 / Chapter 4.3 --- Proposed Concept Clustering --- p.40 / Chapter 4.3.1 --- Semantic Distance between Concepts --- p.43 / Chapter 4.3.2 --- Bounded Density-Based Clustering --- p.47 / Chapter 4.3.3 --- Document Assignment with Topic Clusters --- p.57 / Chapter 4.4 --- Discussion --- p.58 / Chapter 5 --- Blog Topic Evolution --- p.61 / Chapter 5.1 --- Topic Evolution Graph --- p.61 / Chapter 5.2 --- Topic Evolution --- p.64 / Chapter 6 --- Experimental Result --- p.69 / Chapter 6.1 --- Evaluation of Topic Cluster --- p.70 / Chapter 6.1.1 --- Evaluation Criteria --- p.70 / Chapter 6.1.2 --- Evaluation Result --- p.73 / Chapter 6.2 --- Evaluation of Topic Evolution --- p.79 / Chapter 6.2.1 --- Results of Topic Evolution Graph --- p.80 / Chapter 6.2.2 --- Evaluation Criteria --- p.82 / Chapter 6.2.3 --- Evaluation of Topic Evolution --- p.83 / Chapter 6.2.4 --- Case Study --- p.84 / Chapter 7 --- Conclusions and Future Work --- p.88 / Chapter 7.1 --- Conclusions --- p.88 / Chapter 7.2 --- Future Work --- p.90 / Bibliography --- p.92 / Chapter A --- Stop Word List --- p.101 / Chapter B --- Feature Selection Comparison --- p.104 / Chapter C --- Topic Evolution --- p.106 / Chapter D --- Topic Cluster --- p.108
|
63 |
Automatic web resource compilation using data miningEscudeiro, Nuno Filipe Fonseca Vasconcelos January 2004 (has links)
Tese de mestrado. Análise de Dados e Sistemas de Apoio à Decisão. Faculdade de Economia. Universidade do Porto. 2004
|
64 |
Embedding constraints into association rules miningKutty, Sangeetha Unknown Date (has links)
Mining frequent patterns from large databases plays a vital role in many data mining tasks and has a broad range of applications. Most previously proposed algorithms have been specifically designed for one type of dataset thus making them unsuitable for a range of datasets. There have been a few techniques suggested to provide performance for these association rules mining algorithms. However, these algorithms do not support a high level of user interaction, relying only on the classic support and confidence metrics for expressing user requirements. On the other hand, techniques exist that focus on improving the level of user interaction at the cost of performance.In this work, we propose a new algorithm, FOLD-growth with Constraints (FGC), which not only provides user interaction but also improves performance over existing popular algorithms. It embeds the user defined constraints into a pre-processing structure to generate constraint satisfied itemsets and uses this result to build a highly compact data structure. Interestingly, the constraint embedding technique makes existing pattern growth methods not only efficient but also highly effective over a range of datasets, irrespective of their data distribution. The technique also supports the use of conjunctions of different types of commonly used constraints.
|
65 |
Data mining algorithms for genomic analysisAo, Sio-iong. January 2007 (has links)
Thesis (Ph. D.)--University of Hong Kong, 2007. / Title proper from title frame. Also available in printed format.
|
66 |
A general framework for mining spatial and spatio-temporal object association patterns in scientific dataYang, Hui. January 2006 (has links)
Thesis (Ph. D.)--Ohio State University, 2006. / Title from first page of PDF file. Includes bibliographical references (p. 143-158).
|
67 |
Realizing a feature-based framework for scientific data miningMehta, Sameep, January 2006 (has links)
Thesis (Ph. D.)--Ohio State University, 2006. / Title from first page of PDF file. Includes bibliographical references (p. 167-176).
|
68 |
Discovering and summarizing email conversationsZhou, Xiaodong 05 1900 (has links)
With the ever increasing popularity of emails, it is very common nowadays that people discuss specific issues, events or tasks among a group of people by emails. Those discussions can be viewed as conversations via emails and are valuable for the user as a personal information repository. For instance, in 10 minutes before a meeting, a user may want to quickly go through a previous discussion via emails that is going to be discussed in the meeting soon. In this case, rather than reading each individual email one by one, it is preferable to read a concise summary of the previous discussion with major information summarized. In this thesis, we study the problem of discovering and summarizing email conversations. We believe that our work can greatly support users with their email folders. However, the characteristics of email conversations, e.g., lack of synchronization, conversational structure and informal writing style, make this task particularly challenging. In this thesis, we tackle this task by considering the following aspects: discovering emails in one conversation, capturing the conversation structure and summarizing the email conversation. We first study how to discover all emails belonging to one conversation. Specifically, we study the hidden email problem, which is important for email summarization and other applications but has not been studied before. We propose a framework to discover and regenerate hidden emails. The empirical evaluation shows that this framework is accurate and scalable to large folders. Second, we build a fragment quotation graph to capture email conversations. The hidden emails belonging to each conversation are also included into the corresponding graph. Based on the quotation graph, we develop a novel email conversation summarizer, ClueWordSummarizer. The comparison with a state-of-the-art email summarizer as well as with a popular multi-document summarizer shows that ClueWordSummarizer obtains a higher accuracy in most cases. Furthermore, to address the characteristics of email conversations, we study several ways to improve the ClueWordSummarizer by considering more lexical features. The experiments show that many of those improvements can significantly increase the accuracy especially the subjective words and phrases.
|
69 |
Automatically Extract Information from Web DocumentsSharma, Dipesh 01 December 2007 (has links)
The Internet could be considered to be a reservoir of useful information in textual form — product catalogs, airline schedules, stock market quotations, weather forecast etc. There has been much interest in building systems that gather such information on a user's behalf. But because these information resources are formatted differently, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. Mining data records in Web pages is useful because they typically present their host pages' essential information, such as lists of products and services. Extracting these structured data objects enables one to integrate data/information from multiple Web pages to provide value-added services, e.g., comparative shopping, meta-querying and search. Web content mining has thus become an area of interest for many researchers because of the phenomenal growth of the Web contents and the economic benefits associated with it. However, due to the heterogeneity of Web pages, automated discovery of targeted information is still posing as a challenging problem.
|
70 |
Privacy-preserving data miningZhang, Nan 15 May 2009 (has links)
In the research of privacy-preserving data mining, we address issues related to extracting
knowledge from large amounts of data without violating the privacy of the data owners.
In this study, we first introduce an integrated baseline architecture, design principles, and
implementation techniques for privacy-preserving data mining systems. We then discuss
the key components of privacy-preserving data mining systems which include three
protocols: data collection, inference control, and information sharing. We present and
compare strategies for realizing these protocols. Theoretical analysis and experimental
evaluation show that our protocols can generate accurate data mining models while
protecting the privacy of the data being mined.
|
Page generated in 0.0963 seconds