1 |
Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE)Razavi, Amir Hossein 16 July 2012 (has links)
In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning perplexity. The multi-level model allows data interpretation in a more conceptual space, rather than only containing scattered words occurring in texts. It aims to perform the extraction of the knowledge tailored for the classification task by automatic creation of a lightweight ontological hierarchy of representations. In the last step, we will train a tailored ensemble learner over a stack of representations at different conceptual granularities. The final result is a mapping and a weighting of the targeted concept of the original learning task, over a stack of representations and granular conceptual elements of its different levels (hierarchical mapping instead of linear mapping over a vector). Finally the entire algorithm is applied to a variety of general text classification tasks, and the performance is evaluated in comparison with well-known algorithms.
|
2 |
Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE)Razavi, Amir Hossein 16 July 2012 (has links)
In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning perplexity. The multi-level model allows data interpretation in a more conceptual space, rather than only containing scattered words occurring in texts. It aims to perform the extraction of the knowledge tailored for the classification task by automatic creation of a lightweight ontological hierarchy of representations. In the last step, we will train a tailored ensemble learner over a stack of representations at different conceptual granularities. The final result is a mapping and a weighting of the targeted concept of the original learning task, over a stack of representations and granular conceptual elements of its different levels (hierarchical mapping instead of linear mapping over a vector). Finally the entire algorithm is applied to a variety of general text classification tasks, and the performance is evaluated in comparison with well-known algorithms.
|
3 |
Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE)Razavi, Amir Hossein January 2012 (has links)
In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning perplexity. The multi-level model allows data interpretation in a more conceptual space, rather than only containing scattered words occurring in texts. It aims to perform the extraction of the knowledge tailored for the classification task by automatic creation of a lightweight ontological hierarchy of representations. In the last step, we will train a tailored ensemble learner over a stack of representations at different conceptual granularities. The final result is a mapping and a weighting of the targeted concept of the original learning task, over a stack of representations and granular conceptual elements of its different levels (hierarchical mapping instead of linear mapping over a vector). Finally the entire algorithm is applied to a variety of general text classification tasks, and the performance is evaluated in comparison with well-known algorithms.
|
4 |
Aspectos semânticos na representação de textos para classificação automática / Semantic aspects in the representation of texts for automatic classificationSinoara, Roberta Akemi 24 May 2018 (has links)
Dada a grande quantidade e diversidade de dados textuais sendo criados diariamente, as aplicações do processo de Mineração de Textos são inúmeras e variadas. Nesse processo, a qualidade da solução final depende, em parte, do modelo de representação de textos adotado. Por se tratar de textos em língua natural, relações sintáticas e semânticas influenciam o seu significado. No entanto, modelos tradicionais de representação de textos se limitam às palavras, não sendo possível diferenciar documentos que possuem o mesmo vocabulário, mas que apresentam visões diferentes sobre um mesmo assunto. Nesse contexto, este trabalho foi motivado pela diversidade das aplicações da tarefa de classificação automática de textos, pelo potencial das representações no modelo espaço-vetorial e pela lacuna referente ao tratamento da semântica inerente aos dados em língua natural. O seu desenvolvimento teve o propósito geral de avançar as pesquisas da área de Mineração de Textos em relação à incorporação de aspectos semânticos na representação de coleções de documentos. Um mapeamento sistemático da literatura da área foi realizado e os problemas de classificação foram categorizados em relação à complexidade semântica envolvida. Aspectos semânticos foram abordados com a proposta, bem como o desenvolvimento e a avaliação de sete modelos de representação de textos: (i) gBoED, modelo que incorpora a semântica obtida por meio de conhecimento do domínio; (ii) Uni-based, modelo que incorpora a semântica por meio da desambiguação lexical de sentidos e hiperônimos de conceitos; (iii) SR-based Terms e SR-based Sentences, modelos que incorporam a semântica por meio de anotações de papéis semânticos; (iv) NASARIdocs, Babel2Vec e NASARI+Babel2Vec, modelos que incorporam a semântica por meio de desambiguação lexical de sentidos e embeddings de palavras e conceitos. Representações de coleções de documentos geradas com os modelos propostos e outros da literatura foram analisadas e avaliadas na classificação automática de textos, considerando datasets de diferentes níveis de complexidade semântica. As propostas gBoED, Uni-based, SR-based Terms e SR-based Sentences apresentam atributos mais expressivos e possibilitam uma melhor interpretação da representação dos documentos. Já as propostas NASARIdocs, Babel2Vec e NASARI+Babel2Vec incorporam, de maneira latente, a semântica obtida de embeddings geradas a partir de uma grande quantidade de documentos externos. Essa propriedade tem um impacto positivo na performance de classificação. / Text Mining applications are numerous and varied since a huge amount of textual data are created daily. The quality of the final solution of a Text Mining process depends, among other factors, on the adopted text representation model. Despite the fact that syntactic and semantic relations influence natural language meaning, traditional text representation models are limited to words. The use of such models does not allow the differentiation of documents that use the same vocabulary but present different ideas about the same subject. The motivation of this work relies on the diversity of text classification applications, the potential of vector space model representations and the challenge of dealing with text semantics. Having the general purpose of advance the field of semantic representation of documents, we first conducted a systematic mapping study of semantics-concerned Text Mining studies and we categorized classification problems according to their semantic complexity. Then, we approached semantic aspects of texts through the proposal, analysis, and evaluation of seven text representation models: (i) gBoED, which incorporates text semantics by the use of domain expressions; (ii) Uni-based, which takes advantage of word sense disambiguation and hypernym relations; (iii) SR-based Terms and SR-based Sentences, which make use of semantic role labels; (iv) NASARIdocs, Babel2Vec and NASARI+Babel2Vec, which take advantage of word sense disambiguation and embeddings of words and senses.We analyzed the expressiveness and interpretability of the proposed text representation models and evaluated their classification performance against different literature models. While the proposed models gBoED, Uni-based, SR-based Terms and SR-based Sentences have improved expressiveness, the proposals NASARIdocs, Babel2Vec and NASARI+Babel2Vec are latently enriched by the embeddings semantics, obtained from the large training corpus. This property has a positive impact on text classification performance.
|
5 |
A Semantic Graph Model for Text Representation and Matching in Document MiningShaban, Khaled January 2006 (has links)
The explosive growth in the number of documents produced daily necessitates the development of effective alternatives to explore, analyze, and discover knowledge from documents. Document mining research work has emerged to devise automated means to discover and analyze useful information from documents. This work has been mainly concerned with constructing text representation models, developing distance measures to estimate similarities between documents, and utilizing that in mining processes such as document clustering, document classification, information retrieval, information filtering, and information extraction. <br /><br /> Conventional text representation methodologies consider documents as bags of words and ignore the meanings and ideas their authors want to convey. It is this deficiency that causes similarity measures to fail to perceive contextual similarity of text passages due to the variation of the words the passages contain, or at least perceive contextually dissimilar text passages as being similar because of the resemblance of words the passages have. <br /><br /> This thesis presents a new paradigm for mining documents by exploiting semantic information of their texts. A formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation scheme for documents. The representation scheme is constructed through accumulation of syntactic and semantic analysis outputs. A new distance measure is developed to determine the similarities between contents of documents. The measure is based on inexact matching of attributed trees. It involves the computation of all distinct similarity common sub-trees, and can be computed efficiently. It is believed that the proposed representation scheme along with the proposed similarity measure will enable more effective document mining processes. <br /><br /> The proposed techniques to mine documents were implemented as vital components in a mining system. A case study of semantic document clustering is presented to demonstrate the working and the efficacy of the framework. Experimental work is reported, and its results are presented and analyzed.
|
6 |
A Semantic Graph Model for Text Representation and Matching in Document MiningShaban, Khaled January 2006 (has links)
The explosive growth in the number of documents produced daily necessitates the development of effective alternatives to explore, analyze, and discover knowledge from documents. Document mining research work has emerged to devise automated means to discover and analyze useful information from documents. This work has been mainly concerned with constructing text representation models, developing distance measures to estimate similarities between documents, and utilizing that in mining processes such as document clustering, document classification, information retrieval, information filtering, and information extraction. <br /><br /> Conventional text representation methodologies consider documents as bags of words and ignore the meanings and ideas their authors want to convey. It is this deficiency that causes similarity measures to fail to perceive contextual similarity of text passages due to the variation of the words the passages contain, or at least perceive contextually dissimilar text passages as being similar because of the resemblance of words the passages have. <br /><br /> This thesis presents a new paradigm for mining documents by exploiting semantic information of their texts. A formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation scheme for documents. The representation scheme is constructed through accumulation of syntactic and semantic analysis outputs. A new distance measure is developed to determine the similarities between contents of documents. The measure is based on inexact matching of attributed trees. It involves the computation of all distinct similarity common sub-trees, and can be computed efficiently. It is believed that the proposed representation scheme along with the proposed similarity measure will enable more effective document mining processes. <br /><br /> The proposed techniques to mine documents were implemented as vital components in a mining system. A case study of semantic document clustering is presented to demonstrate the working and the efficacy of the framework. Experimental work is reported, and its results are presented and analyzed.
|
7 |
Aspectos semânticos na representação de textos para classificação automática / Semantic aspects in the representation of texts for automatic classificationRoberta Akemi Sinoara 24 May 2018 (has links)
Dada a grande quantidade e diversidade de dados textuais sendo criados diariamente, as aplicações do processo de Mineração de Textos são inúmeras e variadas. Nesse processo, a qualidade da solução final depende, em parte, do modelo de representação de textos adotado. Por se tratar de textos em língua natural, relações sintáticas e semânticas influenciam o seu significado. No entanto, modelos tradicionais de representação de textos se limitam às palavras, não sendo possível diferenciar documentos que possuem o mesmo vocabulário, mas que apresentam visões diferentes sobre um mesmo assunto. Nesse contexto, este trabalho foi motivado pela diversidade das aplicações da tarefa de classificação automática de textos, pelo potencial das representações no modelo espaço-vetorial e pela lacuna referente ao tratamento da semântica inerente aos dados em língua natural. O seu desenvolvimento teve o propósito geral de avançar as pesquisas da área de Mineração de Textos em relação à incorporação de aspectos semânticos na representação de coleções de documentos. Um mapeamento sistemático da literatura da área foi realizado e os problemas de classificação foram categorizados em relação à complexidade semântica envolvida. Aspectos semânticos foram abordados com a proposta, bem como o desenvolvimento e a avaliação de sete modelos de representação de textos: (i) gBoED, modelo que incorpora a semântica obtida por meio de conhecimento do domínio; (ii) Uni-based, modelo que incorpora a semântica por meio da desambiguação lexical de sentidos e hiperônimos de conceitos; (iii) SR-based Terms e SR-based Sentences, modelos que incorporam a semântica por meio de anotações de papéis semânticos; (iv) NASARIdocs, Babel2Vec e NASARI+Babel2Vec, modelos que incorporam a semântica por meio de desambiguação lexical de sentidos e embeddings de palavras e conceitos. Representações de coleções de documentos geradas com os modelos propostos e outros da literatura foram analisadas e avaliadas na classificação automática de textos, considerando datasets de diferentes níveis de complexidade semântica. As propostas gBoED, Uni-based, SR-based Terms e SR-based Sentences apresentam atributos mais expressivos e possibilitam uma melhor interpretação da representação dos documentos. Já as propostas NASARIdocs, Babel2Vec e NASARI+Babel2Vec incorporam, de maneira latente, a semântica obtida de embeddings geradas a partir de uma grande quantidade de documentos externos. Essa propriedade tem um impacto positivo na performance de classificação. / Text Mining applications are numerous and varied since a huge amount of textual data are created daily. The quality of the final solution of a Text Mining process depends, among other factors, on the adopted text representation model. Despite the fact that syntactic and semantic relations influence natural language meaning, traditional text representation models are limited to words. The use of such models does not allow the differentiation of documents that use the same vocabulary but present different ideas about the same subject. The motivation of this work relies on the diversity of text classification applications, the potential of vector space model representations and the challenge of dealing with text semantics. Having the general purpose of advance the field of semantic representation of documents, we first conducted a systematic mapping study of semantics-concerned Text Mining studies and we categorized classification problems according to their semantic complexity. Then, we approached semantic aspects of texts through the proposal, analysis, and evaluation of seven text representation models: (i) gBoED, which incorporates text semantics by the use of domain expressions; (ii) Uni-based, which takes advantage of word sense disambiguation and hypernym relations; (iii) SR-based Terms and SR-based Sentences, which make use of semantic role labels; (iv) NASARIdocs, Babel2Vec and NASARI+Babel2Vec, which take advantage of word sense disambiguation and embeddings of words and senses.We analyzed the expressiveness and interpretability of the proposed text representation models and evaluated their classification performance against different literature models. While the proposed models gBoED, Uni-based, SR-based Terms and SR-based Sentences have improved expressiveness, the proposals NASARIdocs, Babel2Vec and NASARI+Babel2Vec are latently enriched by the embeddings semantics, obtained from the large training corpus. This property has a positive impact on text classification performance.
|
8 |
The Functions of Textbooks : A Textbook Analysis of Text Genres and their RepresentationMagnusson, Jennie January 2021 (has links)
The aim of this study is to investigate which text genres are included in three textbooks intended for the course English 5, and how they are represented. This is done in order to discuss the functions that the textbooks Blueprint A, Solid Gold and Engelska 5 & 6: Outlooks on, can have in the English classroom. The investigation is conducted through a textbook analysis that includes both a quantitative content analysis to create an overview of the text genres, and a close reading text analysis that investigates the representation of the texts with the help of Bloom’s Taxonomy. The investigation finds that all three books contain several different text genres and that they use some of the text genres more frequently than the other books do. This means that the three books have different texts in focus. Since the study also shows that the texts are represented in many different ways, the study concludes that the books can have many different functions, specific for each book and therefore it can be important to evaluate textbooks before using them and to use more than one. However, one major factor that all these three books can help teachers with is to introduce pupils to many different text genres and improve learning by including most of the perspectives introduced by Bloom as important levels of learning.
|
9 |
Analýza recenzí výrobků / Analysis of Product ReviewsKlocok, Andrej January 2020 (has links)
Online store customers generate vast amounts of product and service information through reviews, which are an important source of feedback. This thesis deals with the creation of a system for the analysis of product and shop reviews in the czech language. It describes the current methods of sentiment analysis and builds on current solutions. The resulting system implements automatic data download and their indexing, subsequently sentiment analysis together with text summary in the form of clustering of similar sentences based on vector representation of the text. A graphical user interface in the form of a web page is also included. A review data set with a total of more than six million reviews was created during the semester along with an interface for easy data export.
|
10 |
Entity-Centric Discourse Analysis and Its Applications / エンティティに注目した談話解析とその応用Wang, Xun 24 November 2017 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第20777号 / 情博第657号 / 新制||情報||113(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授 黒橋 禎夫, 教授 河原 達也, 教授 石田 亨 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
Page generated in 0.0299 seconds