Spelling suggestions: "subject:"text mining"" "subject:"next mining""
21 |
NewsFerret : supporting identity risk identification and analysis through text mining of news storiesGolden, Ryan Christian 18 December 2013 (has links)
Individuals, organizations, and devices are now interconnected to an unprecedented degree. This has forced identity risk analysts to redefine what “identity” means in such a context, and to explore new techniques for analyzing an ever expanding threat context. Major hurdles to modeling in this field include the inherent lack of publicly available data due to privacy and safety concerns, as well as the unstructured nature of incident reports. To address this, this report develops a system for strengthening an identity risk model using the text mining of news stories. The system—called NewsFerret—collects and analyzes news stories on the topic of identity theft, establishes semantic relatedness measures between identity concept pairs, and supports analysis of those measures through reports, visualizations, and relevant news stories. Evaluating the resulting analytical models shows where the system is effective in assisting the risk analyst to expand and validate identity risk models. / text
22 |
Document Clustering with Dual SupervisionHu, Yeming 19 June 2012 (has links)
Nowadays, academic researchers maintain a personal library of papers, which they would like
to organize based on their needs, e.g., research, projects, or courseware. Clustering techniques
are often employed to achieve this goal by grouping the document collection into different
topics. Unsupervised clustering does not require any user effort but only produces one universal
output with which users may not be satisfied. Therefore, document clustering needs user input
for guidance to generate personalized clusters for different users. Semi-supervised clustering
incorporates prior information and has the potential to produce customized clusters. Traditional
semi-supervised clustering is based on user supervision in the form of labeled instances or
pairwise instance constraints. However, alternative forms of user supervision exist such as
labeling features. For document clustering, document supervision involves labeling documents
while feature supervision involves labeling features. Their joint of use has been called dual
supervision. In this thesis, we first explore and propose a framework to use feature supervision
for interactive feature selection by indicating whether a feature is useful for clustering.
Second, we enhance the semi-supervised clustering with feature supervision using feature
reweighting. Third, we propose a unified framework to combine document supervision and
feature supervision through seeding. The newly proposed algorithms are evaluated using oracles
and demonstrated to be more helpful in producing better clusters matching a single user's point
of view than document clustering without any supervision and with only document supervision.
Finally, we conduct a user study to confirm that different users have different understandings of
the same document collection and prefer personalized clusters. At the same time, we demonstrate
that document clustering with dual supervision is able to produce good personalized clusters
even with noisy user input. Dual supervision is also demonstrated to be more effective in
personalized clustering than no supervision or any single supervision. We also analyze users'
behaviors during the user study and present suggestions for the design of document management
23 |
Modelling Deception Detection in TextGupta, Smita 29 November 2007 (has links)
As organizations and government agencies work diligently to detect financial irregularities, malfeasance, fraud and criminal activities through intercepted communication, there is an increasing interest in devising an automated model/tool for deception detection. We build on Pennebaker's empirical model which suggests that deception in text leaves a linguistic signature characterised by changes in frequency of four categories of words: first-person pronouns, exclusive words, negative emotion words, and action words. By applying the model to the Enron email dataset and using an unsupervised matrix-decomposition technique, we explore the differential use of these cue-words/categories in deception detection. Instead of focusing on the predictive power of the individual cue-words, we construct a descriptive model which helps us to understand the multivariate profile of deception based on several linguistic dimensions and highlights the qualitative differences between deceptive and truthful communication. This descriptive model can not only help detect unusual and deceptive communication, but also possibly rank messages along a scale of relative deceptiveness (for instance from strategic negotiation and spin to deception and blatant lying). The model is unintrusive, requires minimal human intervention and, by following the defined pre-processing steps it may be applied to new datasets from different domains. / Thesis (Master, Computing) -- Queen's University, 2007-11-28 18:10:30.45
24 |
概念を用いたHK Graphによるテキスト解析支援FURUHASHI, Takeshi, YOSHIKAWA, Tomohiro, KOBAYASHI, Daisuke, 古橋, 武, 吉川, 大弘, 小林, 大輔 29 March 2012 (has links)
No description available.
25 |
Application of the Recommendation Architecture Model for Text MiningUdithaw@ou.ac.lk, Hemali Uditha Wijewardane Ratnayake January 2004 (has links)
The Recommendation Architecture (RA) model is a new connectionist approach simulating some aspects of the human brain. Application of the RA to a real world problem is a novel research problem and has not been previously addressed in literature. Research conducted with simulated data has shown much promise for the Recommendation Architecture models ability in pattern discovery and pattern recognition. This thesis investigates the application of the RA model for text mining where pattern discovery and recognition play an important role.
The clustering system of the RA model is examined in detail and a formal notation for representing the fundamental components and algorithms is proposed for clarity of understanding. A software simulation of the clustering system of the RA model is built for empirical studies. In the argument that the RA model is applicable for text mining the following aspects of the model are examined. With its pattern recognition ability the clustering system of the RA is adapted for text classification and text organization. As the core of the RA model is concerned with pattern discovery or identification of associative similarities in input, it is also used to discover unsuspected relationships within the content of documents. How the RA model can be applied to the problems of pattern discovery in text and classification of text is addressed demonstrating results from a series of experiments. The difficulties in applying the RA model to real life data are described and several extensions to the RA model for optimal performance are proposed from the insights obtained from experiments. Furthermore, the RA model can be extended to provide user-friendly interpretation of results. This research shows that with the proposed extensions the RA model can be successfully applied to the problem of text mining to a large extent. Some limitations exist when the RA model is applied to very noisy data, which are also demonstrated here.
26 |
Automatisierte Verfahren für die Themenanalyse nachrichtenorientierter TextquellenNiekler, Andreas 20 January 2016 (has links) (PDF)
Im Bereich der medienwissenschaftlichen Inhaltsanalyse stellt die Themenanalyse
einen wichtigen Bestandteil dar. Für die Analyse großer digitaler Textbestände hin-
sichtlich thematischer Strukturen ist es deshalb wichtig, das Potential automatisierter
computergestützter Methoden zu untersuchen. Dabei müssen die methodischen und
analytischen Anforderungen der Inhaltsanalyse beachtet und abgebildet werden, wel-
che auch für die Themenanalyse gelten. In dieser Arbeit werden die Möglichkeiten der
Automatisierung der Themenanalyse und deren Anwendungsperspektiven untersucht.
Dabei wird auf theoretische und methodische Grundlagen der Inhaltsanalyse und auf
linguistische Theorien zu Themenstrukturen zurückgegriffen,um Anforderungen an ei-
ne automatische Analyse abzuleiten. Den wesentlichen Beitrag stellt die Untersuchung
der Potentiale und Werkzeuge aus den Bereichen des Data- und Text-Mining dar, die
für die inhaltsanalytische Arbeit in Textdatenbanken hilfreich und gewinnbringend
eingesetzt werden können. Weiterhin wird eine exemplarische Analyse durchgeführt,
um die Anwendbarkeit automatischer Methoden für Themenanalysen zu zeigen. Die
Arbeit demonstriert auch Möglichkeiten der Nutzung interaktiver Oberflächen, formu-
liert die Idee und Umsetzung einer geeigneten Software und zeigt die Anwendung eines
möglichen Arbeitsablaufs für die Themenanalyse auf. Die Darstellung der Potentiale
automatisierter Themenuntersuchungen in großen digitalen Textkollektionen in dieser
Arbeit leistet einen Beitrag zur Erforschung der automatisierten Inhaltsanalyse.
Ausgehend von den Anforderungen, die an eine Themenanalyse gestellt werden,
zeigt diese Arbeit, mit welchen Methoden und Automatismen des Text-Mining diesen
Anforderungen nahe gekommen werden kann. Zusammenfassend sind zwei Anforde-
rungen herauszuheben, deren jeweilige Erfüllung die andere beeinflusst. Zum einen
ist eine schnelle thematische Erfassung der Themen in einer komplexen Dokument-
sammlung gefordert, um deren inhaltliche Struktur abzubilden und um Themen
kontrastieren zu können. Zum anderen müssen die Themen in einem ausreichenden
Detailgrad abbildbar sein, sodass eine Analyse des Sinns und der Bedeutung der The-
meninhalte möglich ist. Beide Ansätze haben eine methodische Verankerung in den
quantitativen und qualitativen Ansätzen der Inhaltsanalyse. Die Arbeit diskutiert
diese Parallelen und setzt automatische Verfahren und Algorithmen mit den Anforde-
rungen in Beziehung. Es können Methoden aufgezeigt werden, die eine semantische
und damit thematische Trennung der Daten erlauben und einen abstrahierten Über-
blick über große Dokumentmengen schaffen. Dies sind Verfahren wie Topic-Modelle
oder clusternde Verfahren. Mit Hilfe dieser Algorithmen ist es möglich, thematisch
kohärente Untermengen in Dokumentkollektion zu erzeugen und deren thematischen
Gehalt für Zusammenfassungen bereitzustellen. Es wird gezeigt, dass die Themen
trotz der distanzierten Betrachtung unterscheidbar sind und deren Häufigkeiten und
Verteilungen in einer Textkollektion diachron dargestellt werden können. Diese Auf-
bereitung der Daten erlaubt die Analyse von thematischen Trends oder die Selektion
bestimmter thematischer Aspekte aus einer Fülle von Dokumenten. Diachrone Be-
trachtungen thematisch kohärenter Dokumentmengen werden dadurch möglich und
die temporären Häufigkeiten von Themen können analysiert werden. Für die detaillier-
te Interpretation und Zusammenfassung von Themen müssen weitere Darstellungen
und Informationen aus den Inhalten zu den Themen erstellt werden. Es kann gezeigt
werden, dass Bedeutungen, Aussagen und Kontexte über eine Kookurrenzanalyse
im Themenkontext stehender Dokumente sichtbar gemacht werden können. In einer
Anwendungsform, welche die Leserichtung und Wortarten beachtet, können häufig
auftretende Wortfolgen oder Aussagen innerhalb einer Thematisierung statistisch
erfasst werden. Die so generierten Phrasen können zur Definition von Kategorien
eingesetzt werden oder mit anderen Themen, Publikationen oder theoretischen An-
nahmen kontrastiert werden. Zudem sind diachrone Analysen einzelner Wörter, von
Wortgruppen oder von Eigennamen in einem Thema geeignet, um Themenphasen,
Schlüsselbegriffe oder Nachrichtenfaktoren zu identifizieren. Die so gewonnenen Infor-
mationen können mit einem „close-reading“ thematisch relevanter Dokumente ergänzt
werden, was durch die thematische Trennung der Dokumentmengen möglich ist. Über
diese methodischen Perspektiven hinaus lassen sich die automatisierten Analysen
als empirische Messinstrumente im Kontext weiterer hier nicht besprochener kommu-
nikationswissenschaftlicher Theorien einsetzen. Des Weiteren zeigt die Arbeit, dass
grafische Oberflächen und Software-Frameworks für die Bearbeitung von automatisier-
ten Themenanalysen realisierbar und praktikabel einsetzbar sind. Insofern zeigen die
Ausführungen, wie die besprochenen Lösungen und Ansätze in die Praxis überführt
werden können.
Wesentliche Beiträge liefert die Arbeit für die Erforschung der automatisierten
Inhaltsanalyse. Die Arbeit dokumentiert vor allem die wissenschaftliche Auseinan-
dersetzung mit automatisierten Themenanalysen. Während der Arbeit an diesem
Thema wurden vom Autor geeignete Vorgehensweisen entwickelt, wie Verfahren des
Text-Mining in der Praxis für Inhaltsanalysen einzusetzen sind. Unter anderem wur-
den Beiträge zur Visualisierung und einfachen Benutzung unterschiedlicher Verfahren
geleistet. Verfahren aus dem Bereich des Topic Modelling, des Clustering und der
Kookkurrenzanalyse mussten angepasst werden, sodass deren Anwendung in inhalts-
analytischen Anwendungen möglich ist. Weitere Beiträge entstanden im Rahmen der
methodologischen Einordnung der computergestützten Themenanalyse und in der
Definition innovativer Anwendungen in diesem Bereich. Die für die vorliegende Arbeit
durchgeführte Experimente und Untersuchungen wurden komplett in einer eigens ent-
wickelten Software durchgeführt, die auch in anderen Projekten erfolgreich eingesetzt
wird. Um dieses System herum wurden Verarbeitungsketten,Datenhaltung,Visualisie-
rung, grafische Oberflächen, Möglichkeiten der Dateninteraktion, maschinelle Lernver-
fahren und Komponenten für das Dokumentretrieval implementiert. Dadurch werden
die komplexen Methoden und Verfahren für die automatische Themenanalyse einfach
anwendbar und sind für künftige Projekte und Analysen benutzerfreundlich verfüg-
bar. Sozialwissenschaftler,Politikwissenschaftler oder Kommunikationswissenschaftler
können mit der Softwareumgebung arbeiten und Inhaltsanalysen durchführen, ohne
die Details der Automatisierung und der Computerunterstützung durchdringen zu
27 |
Applying computer-assisted assessment to auto-generating feedback on project proposalsAl-Yazeedi, Fatema January 2016 (has links)
Through different learning portals, computer-assisted assessment (CAA) tools have improved considerably over the past few decades. In a CAA community, these tools are categorised into types of questions, types of testing, and types of assessment. Most of these provide the assessment of multiple-choice questions, true and false questions, or matching questions. Other CAA tools evaluate short and long essay questions, each of which different grading methods and techniques in terms of style and content have. However, due to the complexity involved in analysing free text writing, the development and evaluation of accurate, easy to use, and effective tools is questionable. This research proposes a new contextual framework as a novel approach to the investigation of a new CAA tool which auto-generates feedback on project proposals. This research follows a Design Science Research paradigm to achieve and evaluate the accuracy, ease of use, and effectiveness of the new tool in the computer science domain in higher education institutes. This is achieved in three interrelated cycles:(1) based on the existent literature on this topic and an exploratory study on the currently available approaches to the provision of feedback on final year project proposals, a proposed framework to auto-generate feedback on any electronically submitted coursework is constructed in order to gain a clear understanding on how such a CAA tool might work; (2) a contextual framework based on the proposed framework for final year project proposals is constructed by considering both the style and content of the free text and using different text mining techniques; and (3) the accuracy, easy to use, and effectiveness of the implemented web-based CAA application named Feedback Automated Tool (FEAT)is evaluated based on the contextual framework. This research applies CAA and text mining techniques to identify and model the key elements of the framework and its components in order to enable the development and evaluation of a novel CAA contextual framework which can be utilised for auto-generating accurate, easy to use, and effective feedback on final year project proposals.
28 |
Spherical k-Means ClusteringBuchta, Christian, Kober, Martin, Feinerer, Ingo, Hornik, Kurt 09 1900 (has links) (PDF)
Clustering text documents is a fundamental task in modern data analysis, requiring
approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing
cosine dissimilarities to perform prototype-based partitioning of term weight representations
of the documents.
This paper presents the theory underlying the standard spherical k-means problem
and suitable extensions, and introduces the R extension package skmeans which provides
a computational environment for spherical k-means clustering featuring several solvers:
a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and
Gmeans). Performance of these solvers is investigated by means of a large scale benchmark
experiment. (authors' abstract)
29 |
Intertextual Readings of the Nyāyabhūṣaṇa on Buddhist Anti-RealismNeill, Tyler 13 December 2022 (has links)
This two-part dissertation has two goals: 1) a close philological reading of a 50-page section of a 10th-century Sanskrit philosophical work (Bhāsarvajña's Nyāyabhūṣaṇa), and 2) the creation and assessment of a novel intertextuality research system (Vātāyana) centered on the same work.
The first half of the dissertation encompasses the philology project in four chapters: 1) background on the author, work, and key philosophical ideas in the passage; 2) descriptions of all known manuscript witnesses of this work and a new critical edition that substantially improves upon the editio princeps; 3) a word-for-word English translation richly annotated with both traditional explanatory material and novel digital links to not one but two interactive online research systems; and 4) a discussion of the Sanskrit author's dialectical strategy in the studied passage.
The second half of the dissertation details the intertextuality research system in a further four chapters: 5) why it is needed and what can be learned from existing projects; 6) the creation of the system consisting of curated textual corpus, composite algorithm in natural language processing and information retrieval, and live web-app interface; 7) an evaluation of system performance measured against a small gold-standard dataset derived from traditional philological research; and 8) a discussion of the impact such new technology could have on humanistic research more broadly. System performance was assessed to be quite good, with a 'recall@5' of 80%, meaning that most previously known cases of mid-length quotation and even paraphrase could be automatically found and returned within the system's top five hits. Moreover, the system was also found to return a 34% surplus of additional significant parallels not found in the small benchmark. This assessment confirms that Vātāyana can be useful to researchers by aiding them in their collection and organization of intertextual observations, leaving them more time to focus on interpretation.
Seventeen appendices illustrate both these efforts and a number of side projects, the latter of which span translation alignment, network visualization of an important database of South Asian prosopography (PANDiT), and a multi-functional Sanskrit text-processing web application (Skrutable).:Preface (i)
Table of Contents (ii)
Abbreviations (v)
Terms and Symbols (v)
Nyāyabhūṣaṇa Witnesses (v)
Main Sanskrit Editions (vi)
Introduction (vii)
A Multi-Disciplinary Project in Intertextual Reading (vii)
Main Object of Study: Nyāyabhūṣaṇa 104–154 (vii)
Project Outline (ix)
Part I: Close Reading (1)
1 Background (1)
1.1 Bhāsarvajña (1)
1.2 The Nyāyabhūṣaṇa (6)
1.2.1 Ts One of Several Commentaries on Bhāsarvajña's Nyāyasāra (6)
1.2.2 In Modern Scholarship, with Focus on NBhū 104–154 (8)
1.3 Philosophical Context (11)
1.3.1 Key Philosophical Concepts (12)
1.3.2 Intra-Textual Context within the Nyāyabhūṣaṇa (34)
1.3.3 Inter-Textual Context (36)
2 Edition of NBhū 104–154 (39)
2.1 Source Materials (39)
2.1.1 Edition of Yogīndrānanda 1968 (E) (40)
2.1.2 Manuscripts (P1, P2, V) (43)
2.1.3 Diplomatic Transcripts (59)
2.2 Notes on Using the Edition (60)
2.3 Critical Edition of NBhū 104–154 with Apparatuses (62)
3 Translation of NBhū 104–154 (108)
3.1 Notes on Translation Method (108)
3.2 Notes on Outline Headings (112)
3.3 Annotated Translation of NBhū 104–154 (114)
4 Discussion (216)
4.1 Internal Structure of NBhū 104–154 (216)
4.2 Critical Assessment of Bhāsarvajña's Argumentation (218)
Part II: Distant Reading with Digital Humanities (224)
5 Background in Intertextuality Detection (224)
5.1 Sanskrit Projects (225)
5.2 Non-Sanskrit Projects (228)
5.3 Operationalizing Intertextuality (233)
6 Building an Intertextuality Machine (239)
6.1 Corpus (Pramāṇa NLP) (239)
6.2 Algorithm (Vātāyana) (242)
6.3 User Interface (Vātāyana) (246)
7 Evaluating System Performance (255)
7.1 Previous Scholarship on NBhū 104–154 as Philological Benchmark (255)
7.2 System Performance Relative to Benchmark (257)
8 Discussion (262)
Conclusion (266)
Works Cited (269)
Main Sanskrit Editions (269)
Works Cited in Part I (271)
Works Cited in Part II (281)
Appendices (285)
Appendix 1: Correspondence of Joshi 1986 to Yogīndrānanda 1968 (286)
Appendix 1D: Full-Text Alignment of Joshi 1986 to Yogīndrānanda 1968 (287)
Appendix 2: Prosopographical Relations Important for NBhū 104–154 (288)
Appendix 2D: Command-Line Tool “Pandit Grapher” (290)
Appendix 3: Previous Suggestions to Improve Text of NBhū 104–154 (291)
Appendix 4D: Transcript and Collation Data for NBhū 104–154 (304)
Appendix 5D: Command-Line Tool “cte2cex” for Transcript Data Conversion (305)
Appendix 6D: Deployment of Brucheion for Interactive Transcript Data (306)
Appendix 7: Highlighted Improvements to Text of NBhū 104–154 (307)
Appendix 7D: Alternate Version of Edition With Highlighted Improvements (316)
Appendix 8D: Digital Forms of Translation of NBhū 104–154 (317)
Appendix 9: Analytic Outline of NBhū 104–154 by Shodo Yamakami (318)
Appendix 10.1: New Analytic Outline of NBhū 104–154 (Overall) (324)
Appendix 10.2: New Analytic Outline of NBhū 104–154 (Detailed) (325)
Appendix 11D: Skrutable Text Processing Library and Web Application (328)
Appendix 12D: Pramāṇa NLP Corpus, Metadata, and LDA Modeling Info (329)
Appendix 13D: Vātāyana Intertextuality Research Web Application (330)
Appendix 14: Sample of Yamakami Citation Benchmark for NBhū 104–154 (331)
Appendix 14D: Full Yamakami Citation Benchmark for NBhū 104–154 (333)
Appendix 15: Vātāyana Recall@5 Scores for NBhū 104–154 (334)
Appendix 16: PVA, PVin, and PVSV Vātāyana Search Hits for Entire NBhū (338)
Appendix 17: Sample Listing of Vātāyana Search Hits for Entire NBhū (349)
Appendix 17D: Full Listing of Vātāyana Search Hits for Entire NBhū (355)
Overview of Digital Appendices (356)
Zusammenfassung (Thesen Zur Dissertation) (357)
Summary of Results (361)
30 |
Using Dependency Parses to Augment Feature Construction for Text MiningGuo, Sheng 18 June 2012 (has links)
With the prevalence of large data stored in the cloud, including unstructured information in the form of text, there is now an increased emphasis on text mining. A broad range of techniques are now used for text mining, including algorithms adapted from machine learning, NLP, computational linguistics, and data mining. Applications are also multi-fold, including classification, clustering, segmentation, relationship discovery, and practically any task that discovers latent information from written natural language.
Classical mining algorithms have traditionally focused on shallow representations such as bag-of-words and similar feature-based models. With the advent of modern high performance computing, deep sentence level linguistic analysis of large scale text corpora has become practical. In this dissertation, we evaluate the utility of dependency parses as textual features for different text mining applications. Dependency parsing is one form of syntactic parsing, based on the dependency grammar implicit in sentences. While dependency parsing has traditionally been used for text understanding, we investigate here its application to supply features for text mining applications.
We specifically focus on three methods to construct textual features from dependency parses. First, we consider a dependency parse as a general feature akin to a traditional bag-of-words model. Second, we consider the dependency parse as the basis to build a feature graph representation. Finally, we use dependency parses in a supervised collocation mining method for feature selection. To investigate these three methods, several applications are studied, including: (i) movie spoiler detection, (ii) text segmentation, (iii) query expansion, and (iv) recommender systems. / Ph. D.
Page generated in 0.101 seconds