Spelling suggestions: "subject:"text classification"" "subject:"text 1classification""
21 |
Short Text Classification in Twitter to Improve Information FilteringSriram, Bharath 03 September 2010 (has links)
No description available.
|
22 |
Improving Text Classification Using Graph-based MethodsKarajeh, Ola Abdel-Raheem Mohammed 05 June 2024 (has links)
Text classification is a fundamental natural language processing task. However, in real-world applications, class distributions are usually skewed, e.g., due to inherent class imbalance. In addition, the task difficulty changes based on the underlying language. When rich morphological structure and high ambiguity are exhibited, natural language understanding can become challenging. For example, Arabic, ranked the fifth most widely used language, has a rich morphological structure and high ambiguity that result from Arabic orthography. Thus, Arabic natural language processing is challenging. Several studies employ Long Short- Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs), but Graph Convolutional Networks (GCNs) have not yet been investigated for the task. Sequence- based models can successfully capture semantics in local consecutive text sequences. On the other hand, graph-based models can preserve global co-occurrences that capture non- consecutive and long-distance semantics. A text representation approach that combines local and global information can enhance performance in practical class imbalance text classification scenarios. Yet, multi-view graph-based text representations have received limited attention.
In this research, first we introduce Multi-view Minority Class Text Graph Convolutional Network (MMCT-GCN), a transductive multi-view text classification model that captures textual graph representations for the minority class alongside sequence-based text representations. Experimental results show that MMCT-GCN obtains consistent improvements over baselines. Second, we develop an Arabic Bidirectional Encoder Representations from Transformers (BERT) Graph Convolutional Network (AraBERT-GCN), a hybrid model that combines the large-scale pre-trained models that encode the local context and semantics alongside graph-based features that are capable of extracting the global word co-occurrences in non-consecutive extended semantics by only one or two hops. Experimental results show that AraBERT-GCN outperforms the state-of-the-art (SOTA) on our Arabic text datasets. Finally, we propose an Arabic Multidimensional Edge Graph Convolutional Network (AraMEGraph) designed for text classification that encapsulates richer and context-aware representations of word and phrase relationships, thus mitigating the impact of the complexity and ambiguity of the Arabic language. / Doctor of Philosophy / The text classification task is an important step in understanding natural language. However, this task has many challenges, such as uneven data distributions and language difficulty. For example, Arabic is the fifth most spoken language. It has many different word forms and meanings, which can make things harder to understand. Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs) are widely utilized for text classification. However, another kind of network called graph convolutional network (GCN) has yet to be explored for this task. Graph-based models keep track of how words are connected, even if they are not right next to each other in a sentence. This helps with better understanding the meaning of words. On the other hand, sequence-based models do well in understanding the meaning of words that are right next to each other. Mixing both types of information in text understanding can work better, especially when dealing with unevenly distributed data. In this research, we introduce a new text classification method called Multi-view Minority Class Text Graph Convolutional Network (MMCT-GCN). This model looks at text from different angles and combines information from graphs and sequence-based models. Our experiments show that this model performs better than other ones proposed in the literature. Additionally, we propose an Arabic BERT Graph Convolutional Network (AraBERT-GCN). It combines pre-trained models that understand words in context and graph features that look at how words relate to each other globally. This helps AraBERT- GCN do better than other models when working with Arabic text. Finally, we develop a special network called Arabic Multidimensional Edge Graph Convolutional Network (AraMEGraph) for Arabic text. It is designed to better understand Arabic and classify text more accurately. We do this by adding special edge features with multiple dimensions to help the network learn the relationships between words and phrases.
|
23 |
Beyond Supervised Learning: Applications and Implications of Zero-shot Text ClassificationBorst-Graetz, Janos 25 October 2024 (has links)
This dissertation explores the application of zero-shot text classification, a technique for categorizing texts without annotated data in the target domain.
A true zero-shot setting breaks with the conventions of the traditional supervised machine learning paradigm that relies on
quantitative in-domain evaluation for optimization, performance measurement, and model selection.
The dissertation summarizes existing research to build a theoretical foundation for zero-shot methods, emphasizing efficiency and transparency.
It benchmarks selected approaches across various tasks and datasets to understand their general performance, strengths, and weaknesses, mirroring the model selection process.
On this foundation, two case studies demonstrate the application of zero-shot text classification:
The first engages with historical German stock market reports, utilizing zero-shot methods for aspect-based sentiment classification.
The case study reveals that although there are qualitative differences between finetuned and zero-shot approaches,
the aggregated results are not easily distinguishable, sparking a discussion about the practical implications.
The second case study integrates zero-shot text classification into a civil engineering document management system,
showcasing how the flexibility of zero-shot models and the omission of the training process can benefit the development of prototype software,
at the cost of an unknown performance.
These findings indicate that, although zero-shot text classification works for the exemplary cases, the results are not generalizable.
Taking up the findings of these case studies, the dissertation discusses dilemmas and theoretical considerations that arise from omitting
the in-domain evaluation of applying zero-shot text classification.
It concludes by advocating a broader focus beyond traditional quantitative metrics in order to build trust in zero-shot text classification,
highlighting their practical utility as well as the necessity for further exploration as these technologies evolve.:1 Introduction
1.1 Problem Context
1.2 Related Work
1.3 Research Questions & Contribution
1.4 Author’s Publications
1.5 Structure of This Work
2 Research Context
2.1 The Current State of Text Classification
2.2 Efficiency
2.3 Approaches to Addressing Data Scarcity in Machine Learning
2.4 Challenges of Recent Developments
2.5 Model Sizes and Hardware Resources
2.6 Conclusion
3 Zero-shot Text Classification
3.1 Text Classification
3.2 State-of-the-Art in Text Classification
3.3 Neural Network Approaches to Data-Efficient Text Classification
3.4 Zero-shot Text Classification
3.5 Application
3.6 Requirements for Zero-shot Models
3.7 Approaches to Transfer Zero-shot
3.7.1 Terminology
3.7.2 Similarity-based and Siamese Networks
3.7.3 Language Model Token Predictions
3.7.4 Sentence Pair Classification
3.7.5 Instruction-following Models or Dialog-based Systems
3.8 Class Name Encoding in Text Classification
3.9 Approach Selection
3.10 Conclusion
4 Model Performance Survey
4.1 Experiments
4.1.1 Datasets
4.1.2 Model Selection
4.1.3 Hypothesis Templates
4.2 Zero-shot Model Evaluation
4.3 Dataset Complexity
4.4 Conclusion
5 Case Study: Historic German Stock Market Reports
5.1 Project
5.2 Motivation
5.3 Related Work
5.4 The Corpus and Dataset - Berliner Börsenzeitung
5.4.1 Corpus
5.4.2 Sentiment Aspects
5.4.3 Annotations
5.5 Methodology
5.5.1 Evaluation Approach
5.5.2 Trained Pipeline
5.5.3 Zero-shot Pipeline
5.5.4 Dictionary Pipeline
5.5.5 Tradeoffs
5.5.6 Label Space Definitions
5.6 Evaluation - Comparison of the Pipelines on BBZ
5.6.1 Sentence-based Sentiment
5.6.2 Aspect-based Sentiment
5.6.3 Qualitative Evaluation
5.7 Discussion and Conclusion
6 Case Study: Document Management in Civil Engineering
6.1 Project
6.2 Motivation
6.3 Related Work
6.4 The Corpus and Knowledge Graph
6.4.1 Data
6.4.2 BauGraph – The Knowledge Graph
6.5 Methodology
6.5.1 Document Insertion Pipeline
6.5.2 Frontend Integration
6.6 Discussion and Conclusion
7 MLMC
7.1 How it works
7.2 Motivation
7.3 Extensions of the Framework
7.4 Other Projects
7.4.1 Product Classification
7.4.2 Democracy Monitor
7.4.3 Climate Change Adaptation Finance
7.5 Conclusion
8 Discussion: The Five Dilemmas of Zero-shot
8.1 On Evaluation
8.2 The Five Dilemmas of Zero-shot
8.2.1 Dilemma of Evaluation or Are You Working at All?
8.2.2 Dilemma of Comparison or How Do I Get the Best Model?
8.2.3 Dilemma of Annotation and Label Definition or Are We Talking about the Same Thing?
8.2.4 Dilemma of Interpretation or Am I Biased?
8.2.5 Dilemma of Unsupervised Text Classification or Do I Have to Trust You?
8.3 Trust in Zero-shot Capabilities
8.4 Conclusion
9 Conclusion
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.1.1 RQ1: Strengths and Weaknesses . . . . . . . . . . . . . . . . 139
9.1.2 RQ2: Application Studies . . . . . . . . . . . . . . . . . . . . 141
9.1.3 RQ3: Implications . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2 Final Thoughts & Future Directions . . . . . . . . . . . . . . . . . . 144
References 147
A Appendix for Survey Chapter A.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.2 Task-specific Hypothesis Templates . . . . . . . . . . . . . . . . . . 180
A.3 Fractions of SotA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
181
B Uncertainty vs. Accuracy 182
C Declaration of Authorship 185
D Declaration: Use of AI-Tools 186
E Bibliographic Data 187 / In dieser Dissertation wird die Anwendung von Zero-Shot-Textklassifikation -- die Kategorisierung von Texten ohne annotierte Daten in der Anwendungsdomäne -- untersucht.
Ein echter Zero-Shot-Ansatz bricht mit den Konventionen des traditionellen überwachten maschinellen Lernens, welches auf einer quantitativen Evaluierung in der Zieldomäne
zur Optimierung,
Performanzmessung und Modellauswahl (model selection) basiert.
Eine Zusammenfassung bestehender Forschungsarbeiten bildet die theoretische Grundlage für die verwendeten Zero-Shot-Methoden, wobei Effizienz und Transparenz im Vordergrund stehen.
Ein Vergleich ausgewählter Ansätze mit verschiedenen Tasks und Datensätzen soll allgemeine Stärken und Schwächen aufzeigen und den Prozess der Modellauswahl widerspiegeln.
Auf dieser Grundlage wird die Anwendung der Zero-Shot-Textklassifikation anhand von zwei Fallstudien demonstriert:
Die erste befasst sich mit historischen deutschen Börsenberichten, wobei Zero-Shot zur aspekt-basierten Sentiment-Klassifikation eingesetzt wird.
Es zeigt sich, dass es zwar qualitative Unterschiede zwischen trainierten und Zero-Shot-Ansätzen gibt, dass die aggregierten Ergebnisse aber nicht leicht zu unterscheiden sind, was Überlegungen zu praktischen Implikationen anstößt.
Die zweite Fallstudie integriert Zero-Shot-Textklassifikation in ein Dokumentenmanagementsystem für das Bauwesen und zeigt, wie die Flexibilität von Zero-Shot-Modellen und der Wegfall des Trainingsprozesses die Entwicklung von Prototypen vereinfachen können -- mit dem Nachteil, dass die Genauigkeit des Modells unbekannt bleibt.
Die Ergebnisse zeigen, dass die Zero-Shot-Textklassifikation in den Beispielanwendungen zwar annähernd funktioniert, die Ergebnisse aber nicht leicht verallgemeinerbar sind.
Im Anschluss werden Dilemmata und theoretische Überlegungen erörtert, die sich aus dem Wegfall der Evaluierung in der Zieldomäne von Zero-Shot-Textklassifikation ergeben.
Abschließend wird ein breiterer Fokus über die traditionellen quantitativen Metriken hinaus vorgeschlagen, um Vertrauen in die Zero-Shot-Textklassifikation aufzubauen und
den praktischen Nutzen zu verbessern. Die Überlegungen zeigen aber auch die Notwendigkeit weiterer Forschung im Zuge der Weiterentwicklung dieser Technologien.:1 Introduction
1.1 Problem Context
1.2 Related Work
1.3 Research Questions & Contribution
1.4 Author’s Publications
1.5 Structure of This Work
2 Research Context
2.1 The Current State of Text Classification
2.2 Efficiency
2.3 Approaches to Addressing Data Scarcity in Machine Learning
2.4 Challenges of Recent Developments
2.5 Model Sizes and Hardware Resources
2.6 Conclusion
3 Zero-shot Text Classification
3.1 Text Classification
3.2 State-of-the-Art in Text Classification
3.3 Neural Network Approaches to Data-Efficient Text Classification
3.4 Zero-shot Text Classification
3.5 Application
3.6 Requirements for Zero-shot Models
3.7 Approaches to Transfer Zero-shot
3.7.1 Terminology
3.7.2 Similarity-based and Siamese Networks
3.7.3 Language Model Token Predictions
3.7.4 Sentence Pair Classification
3.7.5 Instruction-following Models or Dialog-based Systems
3.8 Class Name Encoding in Text Classification
3.9 Approach Selection
3.10 Conclusion
4 Model Performance Survey
4.1 Experiments
4.1.1 Datasets
4.1.2 Model Selection
4.1.3 Hypothesis Templates
4.2 Zero-shot Model Evaluation
4.3 Dataset Complexity
4.4 Conclusion
5 Case Study: Historic German Stock Market Reports
5.1 Project
5.2 Motivation
5.3 Related Work
5.4 The Corpus and Dataset - Berliner Börsenzeitung
5.4.1 Corpus
5.4.2 Sentiment Aspects
5.4.3 Annotations
5.5 Methodology
5.5.1 Evaluation Approach
5.5.2 Trained Pipeline
5.5.3 Zero-shot Pipeline
5.5.4 Dictionary Pipeline
5.5.5 Tradeoffs
5.5.6 Label Space Definitions
5.6 Evaluation - Comparison of the Pipelines on BBZ
5.6.1 Sentence-based Sentiment
5.6.2 Aspect-based Sentiment
5.6.3 Qualitative Evaluation
5.7 Discussion and Conclusion
6 Case Study: Document Management in Civil Engineering
6.1 Project
6.2 Motivation
6.3 Related Work
6.4 The Corpus and Knowledge Graph
6.4.1 Data
6.4.2 BauGraph – The Knowledge Graph
6.5 Methodology
6.5.1 Document Insertion Pipeline
6.5.2 Frontend Integration
6.6 Discussion and Conclusion
7 MLMC
7.1 How it works
7.2 Motivation
7.3 Extensions of the Framework
7.4 Other Projects
7.4.1 Product Classification
7.4.2 Democracy Monitor
7.4.3 Climate Change Adaptation Finance
7.5 Conclusion
8 Discussion: The Five Dilemmas of Zero-shot
8.1 On Evaluation
8.2 The Five Dilemmas of Zero-shot
8.2.1 Dilemma of Evaluation or Are You Working at All?
8.2.2 Dilemma of Comparison or How Do I Get the Best Model?
8.2.3 Dilemma of Annotation and Label Definition or Are We Talking about the Same Thing?
8.2.4 Dilemma of Interpretation or Am I Biased?
8.2.5 Dilemma of Unsupervised Text Classification or Do I Have to Trust You?
8.3 Trust in Zero-shot Capabilities
8.4 Conclusion
9 Conclusion
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.1.1 RQ1: Strengths and Weaknesses . . . . . . . . . . . . . . . . 139
9.1.2 RQ2: Application Studies . . . . . . . . . . . . . . . . . . . . 141
9.1.3 RQ3: Implications . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2 Final Thoughts & Future Directions . . . . . . . . . . . . . . . . . . 144
References 147
A Appendix for Survey Chapter A.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.2 Task-specific Hypothesis Templates . . . . . . . . . . . . . . . . . . 180
A.3 Fractions of SotA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
181
B Uncertainty vs. Accuracy 182
C Declaration of Authorship 185
D Declaration: Use of AI-Tools 186
E Bibliographic Data 187
|
24 |
Multi Domain Semantic Information Retrieval Based on Topic ModelLee, Sanghoon 07 May 2016 (has links)
Over the last decades, there have been remarkable shifts in the area of Information Retrieval (IR) as huge amount of information is increasingly accumulated on the Web. The gigantic information explosion increases the need for discovering new tools that retrieve meaningful knowledge from various complex information sources. Thus, techniques primarily used to search and extract important information from numerous database sources have been a key challenge in current IR systems.
Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics.
In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications.
|
25 |
High performance latent dirichlet allocation for text miningLiu, Zelong January 2013 (has links)
Latent Dirichlet Allocation (LDA), a total probability generative model, is a three-tier Bayesian model. LDA computes the latent topic structure of the data and obtains the significant information of documents. However, traditional LDA has several limitations in practical applications. LDA cannot be directly used in classification because it is a non-supervised learning model. It needs to be embedded into appropriate classification algorithms. LDA is a generative model as it normally generates the latent topics in the categories where the target documents do not belong to, producing the deviation in computation and reducing the classification accuracy. The number of topics in LDA influences the learning process of model parameters greatly. Noise samples in the training data also affect the final text classification result. And, the quality of LDA based classifiers depends on the quality of the training samples to a great extent. Although parallel LDA algorithms are proposed to deal with huge amounts of data, balancing computing loads in a computer cluster poses another challenge. This thesis presents a text classification method which combines the LDA model and Support Vector Machine (SVM) classification algorithm for an improved accuracy in classification when reducing the dimension of datasets. Based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the algorithm automatically optimizes the number of topics to be selected which reduces the number of iterations in computation. Furthermore, this thesis presents a noise data reduction scheme to process noise data. When the noise ratio is large in the training data set, the noise reduction scheme can always produce a high level of accuracy in classification. Finally, the thesis parallelizes LDA using the MapReduce model which is the de facto computing standard in supporting data intensive applications. A genetic algorithm based load balancing algorithm is designed to balance the workloads among computers in a heterogeneous MapReduce cluster where the computers have a variety of computing resources in terms of CPU speed, memory space and hard disk space.
|
26 |
Word based off-line handwritten Arabic classification and recognition : design of automatic recognition system for large vocabulary offline handwritten Arabic words using machine learning approachesAlKhateeb, Jawad Hasan Yasin January 2010 (has links)
The design of a machine which reads unconstrained words still remains an unsolved problem. For example, automatic interpretation of handwritten documents by a computer is still under research. Most systems attempt to segment words into letters and read words one character at a time. However, segmenting handwritten words is very difficult. So to avoid this words are treated as a whole. This research investigates a number of features computed from whole words for the recognition of handwritten words in particular. Arabic text classification and recognition is a complicated process compared to Latin and Chinese text recognition systems. This is due to the nature cursiveness of Arabic text. The work presented in this thesis is proposed for word based recognition of handwritten Arabic scripts. This work is divided into three main stages to provide a recognition system. The first stage is the pre-processing, which applies efficient pre-processing methods which are essential for automatic recognition of handwritten documents. In this stage, techniques for detecting baseline and segmenting words in handwritten Arabic text are presented. Then connected components are extracted, and distances between different components are analyzed. The statistical distribution of these distances is then obtained to determine an optimal threshold for word segmentation. The second stage is feature extraction. This stage makes use of the normalized images to extract features that are essential in recognizing the images. Various method of feature extraction are implemented and examined. The third and final stage is the classification. Various classifiers are used for classification such as K nearest neighbour classifier (k-NN), neural network classifier (NN), Hidden Markov models (HMMs), and the Dynamic Bayesian Network (DBN). To test this concept, the particular pattern recognition problem studied is the classification of 32492 words using ii the IFN/ENIT database. The results were promising and very encouraging in terms of improved baseline detection and word segmentation for further recognition. Moreover, several feature subsets were examined and a best recognition performance of 81.5% is achieved.
|
27 |
Role of semantic indexing for text classificationSani, Sadiq January 2014 (has links)
The Vector Space Model (VSM) of text representation suffers a number of limitations for text classification. Firstly, the VSM is based on the Bag-Of-Words (BOW) assumption where terms from the indexing vocabulary are treated independently of one another. However, the expressiveness of natural language means that lexically different terms often have related or even identical meanings. Thus, failure to take into account the semantic relatedness between terms means that document similarity is not properly captured in the VSM. To address this problem, semantic indexing approaches have been proposed for modelling the semantic relatedness between terms in document representations. Accordingly, in this thesis, we empirically review the impact of semantic indexing on text classification. This empirical review allows us to answer one important question: how beneficial is semantic indexing to text classification performance. We also carry out a detailed analysis of the semantic indexing process which allows us to identify reasons why semantic indexing may lead to poor text classification performance. Based on our findings, we propose a semantic indexing framework called Relevance Weighted Semantic Indexing (RWSI) that addresses the limitations identified in our analysis. RWSI uses relevance weights of terms to improve the semantic indexing of documents. A second problem with the VSM is the lack of supervision in the process of creating document representations. This arises from the fact that the VSM was originally designed for unsupervised document retrieval. An important feature of effective document representations is the ability to discriminate between relevant and non-relevant documents. For text classification, relevance information is explicitly available in the form of document class labels. Thus, more effective document vectors can be derived in a supervised manner by taking advantage of available class knowledge. Accordingly, we investigate approaches for utilising class knowledge for supervised indexing of documents. Firstly, we demonstrate how the RWSI framework can be utilised for assigning supervised weights to terms for supervised document indexing. Secondly, we present an approach called Supervised Sub-Spacing (S3) for supervised semantic indexing of documents. A further limitation of the standard VSM is that an indexing vocabulary that consists only of terms from the document collection is used for document representation. This is based on the assumption that terms alone are sufficient to model the meaning of text documents. However for certain classification tasks, terms are insufficient to adequately model the semantics needed for accurate document classification. A solution is to index documents using semantically rich concepts. Accordingly, we present an event extraction framework called Rule-Based Event Extractor (RUBEE) for identifying and utilising event information for concept-based indexing of incident reports. We also demonstrate how certain attributes of these events e.g. negation, can be taken into consideration to distinguish between documents that describe the occurrence of an event, and those that mention the non-occurrence of that event.
|
28 |
A Study on Text Classification Methods and Text FeaturesDanielsson, Benjamin January 2019 (has links)
When it comes to the task of classification the data used for training is the most crucial part. It follows that how this data is processed and presented for the classifier plays an equally important role. This thesis attempts to investigate the performance of multiple classifiers depending on the features that are used, the type of classes to classify and the optimization of said classifiers. The classifiers of interest are support-vector machines (SMO) and multilayer perceptron (MLP), the features tested are word vector spaces and text complexity measures, along with principal component analysis on the complexity measures. The features are created based on the Stockholm-Umeå-Corpus (SUC) and DigInclude, a dataset containing standard and easy-to-read sentences. For the SUC dataset the classifiers attempted to classify texts into nine different text categories, while for the DigInclude dataset the sentences were classified into either standard or simplified classes. The classification tasks on the DigInclude dataset showed poor performance in all trials. The SUC dataset showed best performance when using SMO in combination with word vector spaces. Comparing the SMO classifier on the text complexity measures when using or not using PCA showed that the performance was largely unchanged between the two, although not using PCA had slightly better performance
|
29 |
Classificação de textos com redes complexas / Using complex networks to classify textsAmancio, Diego Raphael 29 October 2013 (has links)
A classificação automática de textos em categorias pré-estabelecidas tem despertado grande interesse nos últimos anos devido à necessidade de organização do número crescente de documentos. A abordagem dominante para classificação é baseada na análise de conteúdo dos textos. Nesta tese, investigamos a aplicabilidade de atributos de estilo em tarefas tradicionais de classificação, usando a modelagem de textos como redes complexas, em que os vértices representam palavras e arestas representam relações de adjacência. Estudamos como métricas topológicas podem ser úteis no processamento de línguas naturais, sendo a tarefa de classificação apoiada por métodos de aprendizado de máquina, supervisionado e não supervisionado. Um estudo detalhado das métricas topológicas revelou que várias delas são informativas, por permitirem distinguir textos escritos em língua natural de textos com palavras distribuídas aleatoriamente. Mostramos também que a maioria das medidas de rede depende de fatores sintáticos, enquanto medidas de intermitência são mais sensíveis à semântica. Com relação à aplicabilidade da modelagem de textos como redes complexas, mostramos que existe uma dependência significativa entre estilo de autores e topologia da rede. Para a tarefa de reconhecimento de autoria de 40 romances escritos por 8 autores, uma taxa de acerto de 65% foi obtida com métricas de rede e intermitência de palavras. Ainda na análise de estilo, descobrimos que livros pertencentes ao mesmo estilo literário tendem a possuir estruturas topológicas similares. A modelagem de textos como redes também foi útil para discriminar sentidos de palavras ambíguas, a partir apenas de informação topológica dos vértices, evidenciando uma relação não trivial entre sintaxe e semântica. Para algumas palavras, a discriminação com redes complexas foi ainda melhor que a estratégia baseada em padrões de recorrência contextual de palavras polissêmicas. Os estudos desenvolvidos nesta tese confirmam que aspectos de estilo e semânticos influenciam na organização estrutural de conceitos em textos modelados como rede. Assim, a modelagem de textos como redes de adjacência de palavras pode ser útil não apenas para entender mecanismos fundamentais da linguagem, mas também para aperfeiçoar aplicações reais quando combinada com métodos tradicionais de processamento de texto. / The automatic classification of texts in pre-established categories is drawing increasing interest owing to the need to organize the ever growing number of electronic documents. The prevailing approach for classification is based on analysis of textual contents. In this thesis, we investigate the applicability of attributes based on textual style using the complex network (CN) representation, where nodes represent words and edges are adjacency relations. We studied the suitability of CN measurements for natural language processing tasks, with classification being assisted by supervised and unsupervised machine learning methods. A detailed study of topological measurements in texts revealed that several measurements are informative in the sense that they are able to distinguish meaningful from shuffled texts. Moreover, most measurements depend on syntactic factors, while intermittency measurements are more sensitive to semantic factors. As for the use of the CN model in practical scenarios, there is significant correlation between authors style and network topology. We achieved an accuracy rate of 65% in discriminating eight authors of novels with the use of network and intermittency measurements. During the stylistic analysis, we also found that books belonging to the same literary movement could be identified from their similar topological features. The network model also proved useful for disambiguating word senses. Upon employing only topological information to characterize nodes representing polysemous words, we found a strong relationship between syntax and semantics. For several words, the CN approach performed surprisingly better than the method based on recurrence patterns of neighboring words. The studies carried out in this thesis confirm that stylistic and semantic aspects play a crucial role in the structural organization of word adjacency networks. The word adjacency model investigated here might be useful not only to provide insight into the underlying mechanisms of the language, but also to enhance the performance of real applications implementing both CN and traditional approaches.
|
30 |
Busca guiada de patentes de Bioinformática / Guided Search of Bioinformatics PatentsDutra, Marcio Branquinho 17 October 2013 (has links)
As patentes são licenças públicas temporárias outorgadas pelo Estado e que garantem aos inventores e concessionários a exploração econômica de suas invenções. Escritórios de marcas e patentes recomendam aos interessados na concessão que, antes do pedido formal de uma patente, efetuem buscas em diversas bases de dados utilizando sistemas clássicos de busca de patentes e outras ferramentas de busca específicas, com o objetivo de certificar que a criação a ser depositada ainda não foi publicada, seja na sua área de origem ou em outras áreas. Pesquisas demonstram que a utilização de informações de classificação nas buscas por patentes melhoram a eficiência dos resultados das consultas. A pesquisa associada ao trabalho aqui reportado tem como objetivo explorar artefatos linguísticos, técnicas de Recuperação de Informação e técnicas de Classificação Textual para guiar a busca por patentes de Bioinformática. O resultado dessa investigação é o Sistema de Busca Guiada de Patentes de Bioinformática (BPS), o qual utiliza um classificador automático para guiar as buscas por patentes de Bioinformática. A utilização do BPS é demonstrada em comparações com ferramentas de busca de patentes atuais para uma coleção específica de patentes de Bioinformática. No futuro, deve-se experimentar o BPS em coleções diferentes e mais robustas. / Patents are temporary public licenses granted by the State to ensure to inventors and assignees economical exploration rights. Trademark and patent offices recommend to perform wide searches in different databases using classic patent search systems and specific tools before a patent\'s application. The goal of these searches is to ensure the invention has not been published yet, either in its original field or in other fields. Researches have shown the use of classification information improves the efficiency on searches for patents. The objetive of the research related to this work is to explore linguistic artifacts, Information Retrieval techniques and Automatic Classification techniques, to guide searches for Bioinformatics patents. The result of this work is the Bioinformatics Patent Search System (BPS), that uses automatic classification to guide searches for Bioinformatics patents. The utility of BPS is illustrated by a comparison with other patent search tools. In the future, BPS system must be experimented with more robust collections.
|
Page generated in 0.1005 seconds