Global ETD Search

401	An Unsupervised Approach to Detecting and Correcting Errors in Text Islam, Md Aminul 01 June 2011 (has links) In practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary covers most tokens of the underlying correct text. Again, most approaches for text correction are for only one or at best a very few types of errors. The purpose of this thesis is to propose an unsupervised approach to detecting and correcting text errors, that can compete with supervised approaches and answer the following questions: Can an unsupervised approach efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature? What is the magnitude of error coverage, in terms of the number of errors that can be corrected? We conclude that (1) it is possible that an unsupervised approach can efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature. Error types include: real-word spelling errors, typographical errors, lexical choice errors, unwanted words, missing words, prepositional errors, article errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation). (2) The magnitude of error coverage, in terms of the number of errors that can be corrected, is almost double of the number of correct words of the text. Although this is not the upper limit, this is what is practically feasible. We use engineering approaches to answer the first question and theoretical approaches to answer and support the second question. We show that finding inherent properties of a correct text using a corpus in the form of an n-gram data set is more appropriate and practical than using other approaches to detecting and correcting errors. Instead of using rule-based approaches and dictionaries, we argue that a corpus can effectively be used to infer the properties of these types of errors, and to detect and correct these errors. We test the robustness of the proposed approach separately for some individual error types, and then for all types of errors. The approach is language-independent, it can be applied to other languages, as long as n-grams are available. The results of this thesis thus suggest that unsupervised approaches, which are often dismissed in favor of supervised ones in the context of many Natural Language Processing (NLP) related tasks, may present an interesting array of NLP-related problem solving strengths. Text Error Detection Spelling Error Google n-gram Unsupervised Text Error Correction
402	Improving Feature Selection Techniques for Machine Learning Tan, Feng 27 November 2007 (has links) As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases. Feature selection Gene selection Text categorization Text classification Genetic algorithm Dimension Reduction Term selection Computer Sciences
403	An Unsupervised Approach to Detecting and Correcting Errors in Text Islam, Md Aminul 01 June 2011 (has links) In practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary covers most tokens of the underlying correct text. Again, most approaches for text correction are for only one or at best a very few types of errors. The purpose of this thesis is to propose an unsupervised approach to detecting and correcting text errors, that can compete with supervised approaches and answer the following questions: Can an unsupervised approach efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature? What is the magnitude of error coverage, in terms of the number of errors that can be corrected? We conclude that (1) it is possible that an unsupervised approach can efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature. Error types include: real-word spelling errors, typographical errors, lexical choice errors, unwanted words, missing words, prepositional errors, article errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation). (2) The magnitude of error coverage, in terms of the number of errors that can be corrected, is almost double of the number of correct words of the text. Although this is not the upper limit, this is what is practically feasible. We use engineering approaches to answer the first question and theoretical approaches to answer and support the second question. We show that finding inherent properties of a correct text using a corpus in the form of an n-gram data set is more appropriate and practical than using other approaches to detecting and correcting errors. Instead of using rule-based approaches and dictionaries, we argue that a corpus can effectively be used to infer the properties of these types of errors, and to detect and correct these errors. We test the robustness of the proposed approach separately for some individual error types, and then for all types of errors. The approach is language-independent, it can be applied to other languages, as long as n-grams are available. The results of this thesis thus suggest that unsupervised approaches, which are often dismissed in favor of supervised ones in the context of many Natural Language Processing (NLP) related tasks, may present an interesting array of NLP-related problem solving strengths. Text Error Detection Spelling Error Google n-gram Unsupervised Text Error Correction
404	An Ensemble Approach for Text Categorization with Positive and Unlabeled Examples Chen, Hsueh-Ching 29 July 2005 (has links) Text categorization is the process of assigning new documents to predefined document categories on the basis of a classification model(s) induced from a set of pre-categorized training documents. In a typical dichotomous classification scenario, the set of training documents includes both positive and negative examples; that is, each of the two categories is associated with training documents. However, in many real-world text categorization applications, positive and unlabeled documents are readily available, whereas the acquisition of samples of negative documents is extremely expensive or even impossible. In this study, we propose and develop an ensemble approach, referred to as E2, to address the limitations of existing algorithms for learning from positive and unlabeled training documents. Using the spam email filtering as the evaluation application, our empirical evaluation results suggest that the proposed E2 technique exhibits more stable and reliable performance than PNB and PEBL. Single-Class Classification Text Mining Positive Examples Text Categorization Unlabeled Examples Ensemble Approach
405	Text Mining: A Burgeoning Quality Improvement Tool J. Mohammad, Mohammad Alkin Cihad 01 November 2007 (has links) (PDF) While the amount of textual data available to us is constantly increasing, managing the texts by human effort is clearly inadequate for the volume and complexity of the information involved. Consequently, requirement for automated extraction of useful knowledge from huge amounts of textual data to assist human analysis is apparent. Text mining (TM) is mostly an automated technique that aims to discover knowledge from textual data. In this thesis, the notion of text mining, its techniques, applications are presented. In particular, the study provides the definition and overview of concepts in text categorization. This would include document representation models, weighting schemes, feature selection methods, feature extraction, performance measure and machine learning techniques. The thesis details the functionality of text mining as a quality improvement tool. It carries out an extensive survey of text mining applications within service sector and manufacturing industry. It presents two broad experimental studies tackling the potential use of text mining for the hotel industry (the comment card analysis), and in automobile manufacturer (miles per gallon analysis). Keywords: Text Mining, Text Categorization, Quality Improvement, Service Sector, Manufacturing Industry. QA Mathematics 1-939
406	Investigations of Term Expansion on Text Mining Techniques Yang, Chin-Sheng 02 August 2002 (has links) Recent advances in computer and network technologies have contributed significantly to global connectivity and stimulated the amount of online textual document to grow extremely rapidly. The rapid accumulation of textual documents on the Web or within an organization requires effective document management techniques, covering from information retrieval, information filtering and text mining. The word mismatch problem represents a challenging issue to be addressed by the document management research. Word mismatch has been extensively investigated in information retrieval (IR) research by the use of term expansion (or specifically query expansion). However, a review of text mining literature suggests that the word mismatch problem has seldom been addressed by text mining techniques. Thus, this thesis aims at investigating the use of term expansion on some text mining techniques, specifically including text categorization, document clustering and event detection. Accordingly, we developed term expansion extensions to these three text mining techniques. The empirical evaluation results showed that term expansion increased the categorization effectiveness when the correlation coefficient feature selection was employed. With respect to document clustering, techniques extended with term expansion achieved comparable clustering effectiveness to existing techniques and showed its superiority in improving clustering specificity measure. Finally, the use of term expansion for supporting event detection has degraded the detection effectiveness as compared to the traditional event detection technique. Term Association Word Mismatch Text Mining Event Detection Term Expansion Document Clustering Text Categorization
407	The Swedish translation of concessive conjuncts in Dan Brown’s Angels and Demons Poltan, Andreas January 2007 (has links) <p>The purpose of this study is to present and analyze the translation of seven selected concessive conjuncts – anyway, however, although, though, still, nonetheless and yet – in Dan Brown’s novel Angels and Demons translated by Ola Klingberg, by means of a comparative method combined with a qualitative analysis. Background and theory are mainly based on Altenberg (1999, 2002) for the conjuncts and Ingo (1991) for translation strategies. The aim is fulfilled by answering the three research questions: 1. How does Klingberg translate the seven selected concessive conjuncts into Swedish? 2. What factors influence the choice of translation alternative? 3. What kinds of strategies does Klingberg use? The main result is that the conjuncts translate into many different alternatives, although most frequently into the Swedish adversative men, followed by a Swedish concessive like ändå. However, the analysis of anyway is inconclusive because there were not enough tokens. The main conclusion is that translation is a difficult area to be involved in since numerous aspects affect the choice of translation alternative, even though it is shown that it is definitely possible to translate more or less ‘correctly’. A second conclusion is that some words are more likely to be translated with a particular word than others.</p> Angels and Demons concessive conjuncts source text target text translation unexpectedness English language Engelska språket
408	Nutzen und Benutzen von Text Mining für die Medienanalyse Richter, Matthias 26 January 2011 (has links) (PDF) Einerseits werden bestehende Ergebnisse aus so unterschiedlichen Richtungen wie etwa der empirischen Medienforschung und dem Text Mining zusammengetragen. Es geht dabei um Inhaltsanalyse, von Hand, mit Unterstützung durch Computer, oder völlig automatisch, speziell auch im Hinblick auf die Faktoren wie Zeit, Entwicklung und Veränderung. Die Verdichtung und Zusammenstellung liefert nicht nur einen Überblick aus ungewohnter Perspektive, in diesem Prozess geschieht auch die Synthese von etwas Neuem. Die Grundthese bleibt dabei immer eine einschließende: So wenig es möglich scheint, dass in Zukunft der Computer Analysen völlig ohne menschliche Interpretation betreiben kann und wird, so wenig werden menschliche Interpretatoren noch ohne die jeweils bestmögliche Unterstützung des Rechners in der Lage sein, komplexe Themen zeitnah umfassend und ohne allzu große subjektive Einflüsse zu bearbeiten – und so wenig werden es sich substantiell wertvolle Analysen noch leisten können, völlig auf derartige Hilfen und Instrumente der Qualitätssicherung zu verzichten. Daraus ergeben sich unmittelbar Anforderungen: Es ist zu klären, wo die Stärken und Schwächen von menschlichen Analysten und von Computerverfahren liegen. Darauf aufbauend gilt es eine optimale Synthese aus beider Seiten Stärken und unter Minimierung der jeweiligen Schwächen zu erzielen. Praktisches Ziel ist letztlich die Reduktion von Komplexität und die Ermöglichung eines Ausgangs aus dem Zustand des systembedingten „overnewsed but uninformed“-Seins. Text Mining Medienanalyse Wörter des Tages text mining media analysis ddc:004
409	Saco-SR-konflikten 1971 – en analys av opinionsbildning i tidningsledare / The Saco-SR Conflict of 1971 – An Analysis of Influencing Opinion in Newspaper Leaders Hellström, Gunilla January 2011 (has links) The aim of this thesis is to study what means are used in newspaper leaders (editorials) to influence public opinion. In order to obtain a wide range of such means, I have chosen material that has a clear timeframe and illustrates strong political antagonism, concerning the 1971 conflict between the Saco and SR unions and the Swedish state. Leaders from eight different newspapers with different party affiliations are analysed – six morning and two evening newspapers. What type of message leaders convey is examined mainly at the sentence level. Writers report what happened, assess the situation and analyse the causes and explanations for there being a labour conflict. They express criticism of those involved in various ways and exhort them to take recommended courses of action to resolve the conflict. Paragraphs can also be categorised in this way. How criticism is expressed is studied in detail because the material is rich in critical utterances of different types. Various theories about text types and speech act theory provide a theoretical background that is applied to the material. A number of different theories about what defines a genre are presented and tested on the leaders. The results of the investigation indicate that a large number of leaders from the morning newspapers are structured in a similar way, with the paragraph as the unit. They reveal a pattern, the normal pattern, where information is presented in a given order in the majority of morning leaders and the greatest number of message types is used. There is also a pattern of analysis/criticism, with critical and analytical paragraphs alternating and the analysis substantiating the criticism, as a rule. The few leaders in the morning newspapers that do not form a pattern may be strongly critical or almost solely analytical. One of the morning newspapers has many critical leaders that argue or incite. No analysis is made of evening newspaper leaders at the paragraph level since the paragraphs are short; instead, they are analysed as a whole, as are the argumentative leaders. The analysis shows that many leaders are structured in a similar way while at the same time there is considerable variation in the material, which is attributable to there being different types of editorials. opinion-influencing leader editorial text type message type text pattern argumentative inciting
410	The Role of the Interruption in Young Adult Epistolary Novels Herzhauser, Betty J. 01 January 2015 (has links) Within the genre of young adult literature, a growing trend is the use of epistolary messages through electronic methods between characters. These messages are set apart from the formal text of the narrative of the novel creating a break in the text features and layout of the page. Epistolary texts require a more sophisticated reading method and level of interpretation because the epistolary style blends multiple voices and points of view into the plot, creating complicated narration. The reader must navigate the narrator’s path in order to extract meaning from the text. In this hermeneutic study, I examined the text structures of three young adult novels that contained epistolary excerpts. I used ethnographic content analysis (Altheide 1987) to isolate, analyze, and then contextualize the different epistolary moments within the narrative of the novel. The study was guided by two research questions: 1. What types of text structures and features did authors of selected young adult literature with epistolary interruptions published since 2008 use across the body of the published work? 2. How did the authors of selected young adult literature situate the different text structures of interruption into the flow of the narrative? What happened after the interruption? I used a coding system that I developed from a case study of the novel Falling for Hamlet by Michelle Ray (2011). Through my analysis I found that the authors used specific verbs to announce an interruption. The interruptions, though few in number, require readers to consider context of the message for event, setting, speaker, purpose and tone as it relates within the message itself and the arc of the plot. In addition, following the interruptions, the reader must decide how to incorporate the epistolary interruption into the narrative as adding to the conflict, adding detail, ending a scene, or simply returning to the narrative. . Therefore, the interruptions in epistolary young adult novels incorporated the text or literacy practices of young adults. Such incorporation reflects the changes in literacy practices in the early 21st century that may render novels of this style a challenge to readers in creating meaning. The study further incorporates Bakhtin’s theory of heteroglossia (1980) that a novel does not contain a single language but a plurality of languages within a single langue and Dresang’s Theory of Radical Change (1999) of connectivity, interactivity, and access. Texts of this nature offer teachers of reading opportunities to guide students through text features to synthesize information in fiction and non-fiction texts. hermeneutics heteroglossia reading text complexity text structure writing Arts and Humanities Curriculum and Instruction Education

Search results