• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 7
  • 3
  • 1
  • 1
  • Tagged with
  • 15
  • 15
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Accurate and robust algorithms for microarray data classification

Hu, Hong January 2008 (has links)
[Abstract]Microarray data classification is used primarily to predict unseen data using a model built on categorized existing Microarray data. One of the major challenges is that Microarray data contains a large number of genes with a small number of samples. This high dimensionality problem has prevented many existing classification methods from directly dealing with this type of data. Moreover, the small number of samples increases the overfitting problem of Classification, as a result leading to lower accuracy classification performance. Another major challenge is that of the uncertainty of Microarraydata quality. Microarray data contains various levels of noise and quite often high levels of noise, and these data lead to unreliable and low accuracy analysis as well as the high dimensionality problem. Most current classification methods are not robust enough to handle these type of data properly.In our research, accuracy and noise resistance or robustness issues are focused on. Our approach is to design a robust classification method for Microarray data classification.An algorithm, called diversified multiple decision trees (DMDT) is proposed, which makes use of a set of unique trees in the decision committee. The DMDT method has increased the diversity of ensemble committees andtherefore the accuracy performance has been enhanced by avoiding overlapping genes among alternative trees.Some strategies to eliminate noisy data have been looked at. Our method ensures no overlapping genes among alternative trees in an ensemble committee, so a noise gene included in the ensemble committee can affect onetree only; other trees in the committee are not affected at all. This design increases the robustness of Microarray classification in terms of resistance to noise data, and therefore reduces the instability caused by overlapping genes in current ensemble methods.The effectiveness of gene selection methods for improving the performance of Microarray classification methods are also discussed.We conclude that the proposed method DMDT substantially outperforms the other well-known ensemble methods, such as Bagging, Boosting and Random Forests, in terms of accuracy and robustness performance. DMDT is more tolerant to noise than Cascading-and-Sharing trees (CS4), particularywith increasing levels of noise in the data. The results also indicate that some classification methods are insensitive to gene selection while some methodsdepend on particular gene selection methods to improve their performance of classification.
2

Classification models for high-dimensional data with sparsity patterns

Tillander, Annika January 2013 (has links)
Today's high-throughput data collection devices, e.g. spectrometers and gene chips, create information in abundance. However, this poses serious statistical challenges, as the number of features is usually much larger than the number of observed units.  Further, in this high-dimensional setting, only a small fraction of the features are likely to be informative for any specific project. In this thesis, three different approaches to the two-class supervised classification in this high-dimensional, low sample setting are considered. There are classifiers that are known to mitigate the issues of high-dimensionality, e.g. distance-based classifiers such as Naive Bayes. However, these classifiers are often computationally intensive and therefore less time-consuming for discrete data. Hence, continuous features are often transformed into discrete features. In the first paper, a discretization algorithm suitable for high-dimensional data is suggested and compared with other discretization approaches. Further, the effect of discretization on misclassification probability in high-dimensional setting is evaluated.   Linear classifiers are more stable which motivate adjusting the linear discriminant procedure to high-dimensional setting. In the second paper, a two-stage estimation procedure of the inverse covariance matrix, applying Lasso-based regularization and Cuthill-McKee ordering is suggested. The estimation gives a block-diagonal approximation of the covariance matrix which in turn leads to an additive classifier. In the third paper, an asymptotic framework that represents sparse and weak block models is derived and a technique for block-wise feature selection is proposed.      Probabilistic classifiers have the advantage of providing the probability of membership in each class for new observations rather than simply assigning to a class. In the fourth paper, a method is developed for constructing a Bayesian predictive classifier. Given the block-diagonal covariance matrix, the resulting Bayesian predictive and marginal classifier provides an efficient solution to the high-dimensional problem by splitting it into smaller tractable problems. The relevance and benefits of the proposed methods are illustrated using both simulated and real data. / Med dagens teknik, till exempel spektrometer och genchips, alstras data i stora mängder. Detta överflöd av data är inte bara till fördel utan orsakar även vissa problem, vanligtvis är antalet variabler (p) betydligt fler än antalet observation (n). Detta ger så kallat högdimensionella data vilket kräver nya statistiska metoder, då de traditionella metoderna är utvecklade för den omvända situationen (p<n).  Dessutom är det vanligtvis väldigt få av alla dessa variabler som är relevanta för något givet projekt och styrkan på informationen hos de relevanta variablerna är ofta svag. Därav brukar denna typ av data benämnas som gles och svag (sparse and weak). Vanligtvis brukar identifiering av de relevanta variablerna liknas vid att hitta en nål i en höstack. Denna avhandling tar upp tre olika sätt att klassificera i denna typ av högdimensionella data.  Där klassificera innebär, att genom ha tillgång till ett dataset med både förklaringsvariabler och en utfallsvariabel, lära en funktion eller algoritm hur den skall kunna förutspå utfallsvariabeln baserat på endast förklaringsvariablerna. Den typ av riktiga data som används i avhandlingen är microarrays, det är cellprov som visar aktivitet hos generna i cellen. Målet med klassificeringen är att med hjälp av variationen i aktivitet hos de tusentals gener (förklaringsvariablerna) avgöra huruvida cellprovet kommer från cancervävnad eller normalvävnad (utfallsvariabeln). Det finns klassificeringsmetoder som kan hantera högdimensionella data men dessa är ofta beräkningsintensiva, därav fungera de ofta bättre för diskreta data. Genom att transformera kontinuerliga variabler till diskreta (diskretisera) kan beräkningstiden reduceras och göra klassificeringen mer effektiv. I avhandlingen studeras huruvida av diskretisering påverkar klassificeringens prediceringsnoggrannhet och en mycket effektiv diskretiseringsmetod för högdimensionella data föreslås. Linjära klassificeringsmetoder har fördelen att vara stabila. Nackdelen är att de kräver en inverterbar kovariansmatris och vilket kovariansmatrisen inte är för högdimensionella data. I avhandlingen föreslås ett sätt att skatta inversen för glesa kovariansmatriser med blockdiagonalmatris. Denna matris har dessutom fördelen att det leder till additiv klassificering vilket möjliggör att välja hela block av relevanta variabler. I avhandlingen presenteras även en metod för att identifiera och välja ut blocken. Det finns också probabilistiska klassificeringsmetoder som har fördelen att ge sannolikheten att tillhöra vardera av de möjliga utfallen för en observation, inte som de flesta andra klassificeringsmetoder som bara predicerar utfallet. I avhandlingen förslås en sådan Bayesiansk metod, givet den blockdiagonala matrisen och normalfördelade utfallsklasser. De i avhandlingen förslagna metodernas relevans och fördelar är visade genom att tillämpa dem på simulerade och riktiga högdimensionella data.
3

The Accuracy of Accuracy Estimates for Single Form Dichotomous Classification Exams

January 2013 (has links)
abstract: The use of exams for classification purposes has become prevalent across many fields including professional assessment for employment screening and standards based testing in educational settings. Classification exams assign individuals to performance groups based on the comparison of their observed test scores to a pre-selected criterion (e.g. masters vs. nonmasters in dichotomous classification scenarios). The successful use of exams for classification purposes assumes at least minimal levels of accuracy of these classifications. Classification accuracy is an index that reflects the rate of correct classification of individuals into the same category which contains their true ability score. Traditional methods estimate classification accuracy via methods which assume that true scores follow a four-parameter beta-binomial distribution. Recent research suggests that Item Response Theory may be a preferable alternative framework for estimating examinees' true scores and may return more accurate classifications based on these scores. Researchers hypothesized that test length, the location of the cut score, the distribution of items, and the distribution of examinee ability would impact the recovery of accurate estimates of classification accuracy. The current simulation study manipulated these factors to assess their potential influence on classification accuracy. Observed classification as masters vs. nonmasters, true classification accuracy, estimated classification accuracy, BIAS, and RMSE were analyzed. In addition, Analysis of Variance tests were conducted to determine whether an interrelationship existed between levels of the four manipulated factors. Results showed small values of estimated classification accuracy and increased BIAS in accuracy estimates with few items, mismatched distributions of item difficulty and examinee ability, and extreme cut scores. A significant four-way interaction between manipulated variables was observed. In additional to interpretations of these findings and explanation of potential causes for the recovered values, recommendations that inform practice and avenues of future research are provided. / Dissertation/Thesis / M.A. Educational Psychology 2013
4

Detekce stresu a únavy v komplexních datech řidiče / Stresss and fatique detection in complex driver's data

Šimoňáková, Sabína January 2021 (has links)
Main aim of our thesis is fatigue and stress detection from biological signals of a driver. Introduction contains information on published methods of detection and thoroughly informs readers about theoretical background necessary for our thesis. In the practical application we have firstly worked with a database of measured rides and subsequently chose their most relevant sections. Extraction and selection of features followed afterward. Five different classification models for tiredness and stress detection were used in the thesis and prediction was based on actual data. Lastly, the final section compares the best model of our thesis with the already published results.
5

Three case studies of using hybrid model machine learning techniques in Educational Data Mining to improve the classification accuracies

Poudyal, Sujan 09 August 2022 (has links) (PDF)
A multitude of data is being produced from the increase in instructional technology, e-learning resources, and online courses. This data could be used by educators to analyze and extract useful information which could be beneficial to both instructors and students. Educational Data Mining (EDM) extracts hidden information from data contained within the educational domain. In data mining, hybrid method is the combination of various machine learning techniques. Through this dissertation, the novel use of machine learning hybrid techniques was explored in EDM using three educational case studies. First, in consideration for the importance of students’ attention, on and off-task data to analyze the attention behavior of the students were collected. Two feature selection techniques, Principal Component Analysis and Linear Discriminant Analysis, were combined to improve the classification accuracies for classifying the students’ attention patterns. The relationship between attention and learning was also studied by calculating Pearson’s correlation coefficient and p-value. Our examination was then shifted towards academic performance as it is important to ensuring a quality education. Two different 2D- Convolutional Neural Network (CNN) models were concatenated and produced a single model to predict students’ academic performance in terms of pass and fail. Lastly, the importance of using machine learning in online learning to maintain academic integrity was considered. In this work, primarily a traditional machine learning algorithms were used to predict the cheaters in an online examination. 1D CNN architecture was then used to extract the features from our cheater dataset and the previously used machine learning model was applied on extracted features to detect the cheaters. Such type of hybrid model outperformed the original traditional machine learning model and CNN model when used alone in terms of classification accuracy. The three studies reflect the use of machine learning application in EDM. Classification accuracy is important in EDM because different educational decisions are made based on the results of our model. So, to increase the accuracies, a hybrid method was employed. Thus, through this dissertation it was successfully shown that hybrid models can be used in EDM to improve the classification accuracies.
6

Altering time compression algorithms of amplitude-integrated electroencephalography display improves neonatal seizure detection

Thomas, Cameron W. 11 October 2013 (has links)
No description available.
7

Item Parameter Drift as an Indication of Differential Opportunity to Learn: An Exploration of item Flagging Methods & Accurate Classification of Examinees

Sukin, Tia M. 01 September 2010 (has links)
The presence of outlying anchor items is an issue faced by many testing agencies. The decision to retain or remove an item is a difficult one, especially when the content representation of the anchor set becomes questionable by item removal decisions. Additionally, the reason for the aberrancy is not always clear, and if the performance of the item has changed due to improvements in instruction, then removing the anchor item may not be appropriate and might produce misleading conclusions about the proficiency of the examinees. This study is conducted in two parts consisting of both a simulation and empirical data analysis. In these studies, the effect on examinee classification was investigated when the decision was made to remove or retain aberrant anchor items. Three methods of detection were explored; (1) delta plot, (2) IRT b-parameter plots, and (3) the RPU method. In the simulation study, degree of aberrancy was manipulated as well as the ability distribution of examinees and five aberrant item schemes were employed. In the empirical data analysis, archived statewide science achievement data that was suspected to possess differential opportunity to learn between administrations was re-analyzed using the various item parameter drift detection methods. The results for both the simulation and empirical data study provide support for eliminating the use of flagged items for linking assessments when a matrix-sampling design is used and a large number of items are used within that anchor. While neither the delta nor the IRT b-parameter plot methods produced results that would overwhelmingly support their use, it is recommended that both methods be employed in practice until further research is conducted for alternative methods, such as the RPU method since classification accuracy increases when such methods are employed and items are removed and most often, growth is not misrepresented by doing so.
8

Using Mathematics Curriculum Based Measurement as an Indicator of Student Performance on State Standards

Hall, Linda D. 2009 December 1900 (has links)
Math skills are essential to daily life, impacting a person?s ability to function at home, work, and in the community. Although reading has been the focus in recent years, many students struggle in math. The inability to master math calculation and problem solving has contributed to the rising incidence of student failure, referrals for special education evaluations, and dropout rates. Studies have shown that curriculum based measurement (CBM) is a well-established tool for formative assessment, and could potentially be used for other purposes such as a prediction of state standards test scores, however to date there are limited validity studies between mathematics CBM and standard-based assessment. This research examined a brief assessment that reported to be aligned to national curriculum standards in order to predict student performance on state standards-based mathematics curriculum, identify students at-risk of failure, and plan instruction. Evidence was gathered on the System to Enhance Educational Performance Grade 3 Focal Mathematics Assessment Instrument (STEEP3M) as a formative, universal screener. Using a sample of 337 students and 22 instructional staff, four qualities of the STEEP3M were examined: a) internal consistency and criterion related validity (concurrent); b) screening students for a multi-tiered decision-making process; c) utility for instructional planning and intervention recommendations; and d) efficiency of administration, scoring, and reporting results which were the basis of the four research questions for this study. Several optimized solutions were generated from Receiver Operator Curve (ROC) statistical analysis; however none demonstrated that the STEEP3M maximized either sensitivity or specificity. In semi-structured interviews teachers reported that they would consider using the STEEP3M, however only as a part of a decision-making rubric along with other measures. Further, teachers indicated that lessons are developed before the school year starts, more in response to the sequence of the state standards than to students? needs. While the STEEP3M was sufficiently long enough for high-stakes or criterion-referenced decisions, this study found that the test does not provide sufficient diagnostic information for multi-tiered decision-making for intervention or instructional planning. Although practical and efficient to administer, the conclusions of this study show the test does not provide sufficient information on the content domain and does not accurately classify students in need of assistance.
9

Bonitní a bankrotní modely / Financial health models and bankruptcy prediction models

ONDOKOVÁ, Lucie January 2016 (has links)
The main aim of the master thesis is to compare of different methodologies of financial health models and bankruptcy prediction models and their cause to company classification. The work deals with the applicability of models on the sample of 45 prosperous companies and 45 companies that were initiating in insolvency process. Sample contain about 33 % companies from building industry, 33 % retail, 16,7 % manufacturing industry and 16,7 % of the other industries mainly services. The special kind of contingency table - the confusion matrix - is used in the methodology to calculate sensitivity, specificity, negative predictive, false positive rate, accuracy, error and other classification statistics. Overall model accuracy is obtained as a difference between accuracy and error. Dependencies of models are acquired based on Pearson´s correlation coefficient. The changes (removing of grey zone and testing new cut-off points) in models are tested in the sensitivity analysis. In practise part there are about 12 financial models calculated (Altman Z´, Altman Z´´, Index IN99, IN01 and IN05, Kralicek Quicktest, Zmijewski model, Taffler model and its modification, Index Creditworthiness, Grunwald Site Index, Doucha´s Analysis). Only two financial indicators (ROA and Sales / Assets) in results were important as crucial part for more than one model. Then are classifications of companies in models determined. It shows that the best models according to overall accuracy are Zmijewski and Altman´s Z´´. On the other hand the worst models are index IN99 and both versions of Taffler´s model. The classification is not caused excessively by extreme values, year of the model creation or country of the origin (hypothesis 1). Based on results it is suggested that the bankruptcy prediction is an accurate forecaster of failure up to three years prior to bankruptcy in most examined models (hypothesis 2). It is observed that the type of model and industry influence the classification of models. In the end, the changes based on sensitivity analysis in the worst companies are made. All of three changes have increased overall classification accuracy of models.
10

Recommendations Regarding Q-Matrix Design and Missing Data Treatment in the Main Effect Log-Linear Cognitive Diagnosis Model

Ma, Rui 11 December 2019 (has links)
Diagnostic classification models used in conjunction with diagnostic assessments are to classify individual respondents into masters and nonmasters at the level of attributes. Previous researchers (Madison & Bradshaw, 2015) recommended items on the assessment should measure all patterns of attribute combinations to ensure classification accuracy, but in practice, certain attributes may not be measured by themselves. Moreover, the model estimation requires large sample size, but in reality, there could be unanswered items in the data. Therefore, the current study sought to provide suggestions on selecting between two alternative Q-matrix designs when an attribute cannot be measured in isolation and when using maximum likelihood estimation in the presence of missing responses. The factorial ANOVA results of this simulation study indicate that adding items measuring some attributes instead of all attributes is more optimal and that other missing data treatments should be sought if the percent of missing responses is greater than 5%.

Page generated in 0.1698 seconds