11 |
Um descritor de imagens baseado em particionamento extremo para busca em bases grandes e heterogêneasVidal, Márcio Luiz Assis 25 October 2013 (has links)
Submitted by Geyciane Santos (geyciane_thamires@hotmail.com) on 2015-06-22T14:59:26Z
No. of bitstreams: 1
Tese- Márcio Luiz Assis Vidal.pdf: 6102842 bytes, checksum: 12c4e5a330ea91e55788a8d2d6b46898 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-06-24T15:29:04Z (GMT) No. of bitstreams: 1
Tese- Márcio Luiz Assis Vidal.pdf: 6102842 bytes, checksum: 12c4e5a330ea91e55788a8d2d6b46898 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-06-24T16:06:25Z (GMT) No. of bitstreams: 1
Tese- Márcio Luiz Assis Vidal.pdf: 6102842 bytes, checksum: 12c4e5a330ea91e55788a8d2d6b46898 (MD5) / Made available in DSpace on 2015-06-24T16:06:25Z (GMT). No. of bitstreams: 1
Tese- Márcio Luiz Assis Vidal.pdf: 6102842 bytes, checksum: 12c4e5a330ea91e55788a8d2d6b46898 (MD5)
Previous issue date: 2013-10-25 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / In this thesis we propose a new image descriptor that address the problem of image
search in large and heterogeneous databases. This approach uses the idea of extreme
partitioning to obtain the visual properties of images that will be converted into a textual description. Once the textual description is appropriately generated, traditional text-based information retrieval techniques can be used. The key point of the proposed work is escalability, given that text-based search techniques can deal with databases with millions of documents. We have carried out experiments in order to con rm the viability of our proposal. The experimental results showed that our technique reaches higher precision levels compared to other content-based image retrieval techniques in a database with more than 100,000 images. / Neste trabalho é proposto um novo descritor de imagens que lida com o problema de busca de imagens em bases grandes e heterogêneas. Esta abordagem utiliza a idéia de um particionamento extremo para obter detalhes da imagem que são convertidos em uma descrição textual. Uma vez que a descrição textual é devidamente gerada, utiliza-se as técnicas de Recuperação de Informação (RI) tradicionais. O ponto chave do trabalho proposto é a representação textual das propriedades visuais das partições de uma imagem. Isto permite uma grande escalabilidade desta técnica, visto a existências de técnicas eficientes de busca baseada em texto para bases da ordem de milhões de documentos. Nossos experimentos comprovaram a viabilidade da técnica proposta, atingindo graus de precisão superiores às técnicas de busca de imagens tradicionais em uma base com mais de 100.000 imagens.
|
12 |
Representações textuais e a geração de hubs : um estudo comparativoAguiar, Raul Freire January 2017 (has links)
Orientador: Prof. Dr. Ronaldo Pratti / Dissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, 2017. / O efeito de hubness, juntamente com a maldição de dimensionalidade, vem sendo estudado, sob diferentes oticas, nos ultimos anos. Os estudos apontam que este problema esta presente em varios conjuntos de dados do mundo real e que a presença de hubs (tendencia de alguns exemplos aparecem com frequencia na lista de vizinhos mais proximos de outros exemplos) traz uma serie de consequencias indesejaveis, como por exemplo, afetar o desempenho de classificadores. Em tarefas de mineração de texto, o problema depende tambem da maneira escolhida pra representar os documentos. Sendo assim o objetivo principal dessa dissertação é avaliar o impacto da formação de hubs em diferentes representações textuais. Ate onde vai o nosso conhecimento e durante o período desta pesquisa,
não foi posivel encontrar na literatura um estudo aprofundado sobre as implicaçõess do efeito de hubness em diferentes representações textuais. Os resultados sugerem que as diferentes representações textuais implicam em corpus com propensão menor para a formação de hubs. Notou-se também que a incidencia de hubs nas diferentes representações textuais possuem in
uencia similar em alguns classificadores. Analisamos tambem o desempenho dos classifcadores apos a remoção de documentos sinalizados como hubs em porçõess pre-estabelecidas do tamanho total do data set. Essa remoção trouxe, a alguns algoritmos, uma tendencia de melhoria de desempenho. Dessa maneira, apesar de nem sempre efetiva, a estrategia de identifcar e remover hubs com uma vizinhança majoritariamente
ruim pode ser uma interessante tecnica de pre-processamento a ser considerada, com o intuito de melhorar o desempenho preditivo da tarefa de classificação. / The hubness phenomenon, associated to the curse of dimensionality, has been studied, from diferent perspectives, in recent years. These studies point out that the hubness problem is present in several real-world data sets and, as a consequence, the hubness implies a series of undesirable side efects, such as an increase in misclassifcation error in classification tasks. In text mining research, this problem also depends on the choice of text representation. Hence, the main objective of the dissertation is to evaluate the impact of the hubs presence in diferent textual representations. To the best of our knowledge, this is the first study that performs an in-depth analysis on the efects of the hub problem in diferent textual representations. The results suggest that diferent text representations
implies in diferent bias towards hubs presence in diferent corpus. It was also noticed that the presence of hubs in dierent text representations has similar in
uence for some classifiers. We also analyzed the performance of classifiers after removing documents
agged as hubs in pre-established portions of the total data set size. This removal allows, to some algorithms, a trend of improvement in performance. Thus, although not always efective, the strategy of identifying and removing hubs with a majority of bad neighborhood may be an interesting preprocessing technique to be considered in order to improve the predictive performance of the text classification task.
|
13 |
A Study on Effective Approaches for Exploiting Temporal Information in News Archives / ニュースアーカイブの時制情報活用のための有効な手法に関する研究Wang, Jiexin 26 September 2022 (has links)
京都大学 / 新制・課程博士 / 博士(情報学) / 甲第24259号 / 情博第803号 / 新制||情||135(附属図書館) / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授 吉川 正俊, 教授 田島 敬史, 教授 黒橋 禎夫, 特定准教授 LIN Donghui / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
14 |
Um método para extração de palavras-chave de documentos representados em grafosAbilhoa, Willyan Daniel 05 February 2014 (has links)
Made available in DSpace on 2016-03-15T19:37:48Z (GMT). No. of bitstreams: 1
Willyan Daniel Abilhoa.pdf: 1956528 bytes, checksum: 5d317e6fd19aebfc36180735bcf6c674 (MD5)
Previous issue date: 2014-02-05 / Fundação de Amparo a Pesquisa do Estado de São Paulo / Twitter is a microblog service that generates a huge amount of textual content daily. All this content needs to be explored by means of techniques, such as text mining, natural language processing and information retrieval. In this context, the automatic keyword extraction is a task of great usefulness that can be applied to indexing, summarization and knowledge extrac-tion from texts. A fundamental step in text mining consists of building a text representation model. The model known as vector space model, VSM, is the most well-known and used among these techniques. However, some difficulties and limitations of VSM, such as scalabil-ity and sparsity, motivate the proposal of alternative approaches. This dissertation proposes a keyword extraction method, called TKG (Twitter Keyword Graph), for tweet collections that represents texts as graphs and applies centrality measures for finding the relevant vertices (keywords). To assess the performance of the proposed approach, two different sets of exper-iments are performed and comparisons with TF-IDF and KEA are made, having human clas-sifications as benchmarks. The experiments performed showed that some variations of TKG are invariably superior to others and to the algorithms used for comparisons. / O Twitter é um serviço de microblog que gera um grande volume de dados textuais. Todo esse conteúdo precisa ser explorado por meio de técnicas de mineração de textos, processamento de linguagem natural e recuperação de informação com o objetivo de extrair um conhecimento que seja útil de alguma forma ou em algum processo. Nesse contexto, a extração automática de palavras-chave é uma tarefa que pode ser usada para a indexação, sumarização e compreensão de documentos. Um passo fundamental nas técnicas de mineração de textos consiste em construir um modelo de representação de documentos. O modelo chamado mode-lo de espaço vetorial, VSM, é o mais conhecido e utilizado dentre essas técnicas. No entanto, algumas dificuldades e limitações do VSM, tais como escalabilidade e esparsidade, motivam a proposta de abordagens alternativas. O presente trabalho propõe o método TKG (Twitter Keyword Graph) de extração de palavras-chave de coleções de tweets que representa textos como grafos e aplica medidas de centralidade para encontrar vértices relevantes, correspondentes às palavras-chave. Para medir o desempenho da abordagem proposta, dois diferentes experimentos são realizados e comparações com TF-IDF e KEA são feitas, tendo classifica-ções humanas como referência. Os experimentos realizados mostraram que algumas variações do TKG são superiores a outras e também aos algoritmos usados para comparação.
|
15 |
Evaluation of the performance of machine learning techniques for email classification / Utvärdering av prestationen av maskininlärningstekniker för e-post klassificeringTapper, Isabella January 2022 (has links)
Manual categorization of a mail inbox can often become time-consuming. Therefore many attempts have been made to use machine learning for this task. One essential Natural Language Processing (NLP) task is text classification, which is a big challenge since an NLP engine is not a native speaker of any human language. An NLP engine often fails at understanding sarcasm and underlying intent. One of the NLP challenges is to represent text. Text embeddings can be learned, or they can be generated from a pre-trained model. Google’s pre-trained model Sentence Bidirectional Encoder Representations from Transformers (SBERT) is state-of-the-art for generating pre-trained vector representation of longer text. In this project, different methods of classifying and clustering emails were studied. The performances of three supervised classification models were compared to each other. A Support Vector Machine (SVM) and a Neural Network (NN) were trained with SBERT embeddings, and the third model, a Recurrent Neural Network (RNN) was trained on raw data. The motivation for this experiment was to see whether SBERT embedding is an excellent choice of text representation when combined with simpler classification models in an email classification task. The results show that the SVM and NN perform higher than RNN in the email classification task. Since most real data is unlabeled, this thesis also evaluated how well unsupervised methods could perform in email clustering taking advantage of the available labels and using SBERT embeddings as text representations. Three unsupervised clustering models are reviewed in this thesis: K-Means (KM), Spectral Clustering (SC), and Hierarchical Agglomerative Clustering (HAC). The results show that the unsupervised models all had a similar performance in terms of precision, recall and F1-score, and the performances were evaluated using the available labeled dataset. In conclusion, this thesis gives evidence that in an email classification task, it is better for supervised models to train with pre-trained SBERT embeddings than to train on raw data. This thesis also showed that the output of the clustering methods compared on par with the output of the selected supervised learning techniques. / Manuell kategorisering av en inkorg kan ofta bli tidskrävande. Därför har många försök gjorts att använda maskininlärning för denna uppgift. En viktig uppgift för Natural Language Processing (NLP) är textklassificering, vilket är en stor utmaning eftersom en språkmotor inte talar något mänskligt språk som modersmål. En språkmotor misslyckas ofta med att förstå sarkasm och underliggande avsikt. En av språkmotorns utmaningar är att representera text. Textinbäddningar kan bli inlärda, eller så kan de genereras av en förutbildad modell. Googles förutbildade modell Sentence Bidirectional Encoder Representations from Transformers (SBERT) är den senaste tekniken för att generera förtränade vektorrepresentation av längre text. I detta projekt studerades olika metoder för att klassificera e-postmeddelanden. Prestandan av tre övervakade klassificeringsmodeller jämfördes med varandra, och av dessa var två utbildade med SBERT-inbäddningar: Support Vector Machine (SVM), Neural Network (NN) och den tredje modellen tränades på rådata: Recurrent Neural Network (RNN). Motivationen till detta experiment var att se om SBERT-inbäddningar tillsammans med enklare klassificeringsmodeller är ett bra val av textrepresentation i en e-post klassificeringsuppgift. Resultaten visar att SVM och NN har högre prestanda än RNN i e-postklassificeringsuppgiften. Eftersom mycket verklig data är omärkt utvärderade denna avhandling också hur väl oövervakade metoder kan utföras i samma e-postklassificeringsuppgift med SBERT-inbäddningar som textrepresentationer. Tre oövervakade klustringsmodeller utvärderas i denna avhandling: K-Means (KM), Spectral Clustering (SC) och Hierarchical Agglomerative Clustering (HAC). Resultaten visar att de oövervakade modeller hade liknande prestanda i precision, recall och F1-score, och prestandan var baserad på de tillgängliga klassannoteringarna. Sammanfattningsvis ger denna avhandling bevis på att i en e-postklassificeringsuppgift är det bättre att övervakade modeller tränar med förtränade SBERT-inbäddningar än att träna på rådata. Denna avhandling visade också att resultatet av klustringsmodellerna hade en jämförbar prestanda med resultatet av de valda övervakade inlärningstekniker.
|
16 |
Natural Language Processing for Improving Search Query Results : Applied on The Swedish Armed Force's Profession GuideHarju Schnee, Andreas January 2023 (has links)
Text has been the historical way of preserving and acquiring knowledge, and text data today is an increasingly growing part of the digital footprint together with the need to query this data for information. Seeking information is a constant ongoing process, and is a crucial part of many systems all around us. The ability to perform fast and effective searches is a must when dealing with vast amounts of data. This thesis implements an information retrieval system based on the Swedish Defence Force's profession guide, with the aim to produce a system that retrieves relevant professions based on user defined queries of varying size. A number of Natural Language Processing techniques are investigated and implemented, in order to transform the gathered profession descriptions a document embedding model, doc2vec, was implemented resulting in document vectors that are compared to find similarities between documents. The final system was evaluated by domain experts, represented by active military personal that quantified the relevancy of the profession retrievals into a measurable performance. The system managed to retrieve relevant information for 46.6% and 56.6% of the long- and short text inputs respectively. Resulting in a much more generalized and capable system compared to the search function available at the profession guide today.
|
17 |
Reprezentace textu a její vliv na kategorizaci / Representation of Text and Its Influence on CategorizationŠabatka, Ondřej January 2010 (has links)
The thesis deals with machine processing of textual data. In the theoretical part, issues related to natural language processing are described and different ways of pre-processing and representation of text are also introduced. The thesis also focuses on the usage of N-grams as features for document representation and describes some algorithms used for their extraction. The next part includes an outline of classification methods used. In the practical part, an application for pre-processing and creation of different textual data representations is suggested and implemented. Within the experiments made, the influence of these representations on accuracy of classification algorithms is analysed.
|
18 |
The research on chinese text multi-label classification / Avancée en classification multi-labels de textes en langue chinoise / 中文文本多标签分类研究Wei, Zhihua 07 May 2010 (has links)
Text Classification (TC) which is an important field in information technology has many valuable applications. When facing the sea of information resources, the objects of TC are more complicated and diversity. The researches in pursuit of effective and practical TC technology are fairly challenging. More and more researchers regard that multi-label TC is more suited for many applications. This thesis analyses the difficulties and problems in multi-label TC and Chinese text representation based on a mass of algorithms for single-label TC and multi-label TC. Aiming at high dimensionality in feature space, sparse distribution in text representation and poor performance of multi-label classifier, this thesis will bring forward corresponding algorithms from different angles.Focusing on the problem of dimensionality “disaster” when Chinese texts are represented by using n-grams, two-step feature selection algorithm is constructed. The method combines filtering rare features within class and selecting discriminative features across classes. Moreover, the proper value of “n”, the strategy of feature weight and the correlation among features are discussed based on variety of experiments. Some useful conclusions are contributed to the research of n-gram representation in Chinese texts.In a view of the disadvantage in Latent Dirichlet Allocation (LDA) model, that is, arbitrarily revising the variable in smooth process, a new strategy for smoothing based on Tolerance Rough Set (TRS) is put forward. It constructs tolerant class in global vocabulary database firstly and then assigns value for out-of-vocabulary (oov) word in each class according to tolerant class.In order to improve performance of multi-label classifier and degrade computing complexity, a new TC method based on LDA model is applied for Chinese text representation. It extracts topics statistically from texts and then texts are represented by using the topic vector. It shows competitive performance both in English and in Chinese corpus.To enhance the performance of classifiers in multi-label TC, a compound classification framework is raised. It partitions the text space by computing the upper approximation and lower approximation. This algorithm decomposes a multi-label TC problem into several single-label TCs and several multi-label TCs which have less labels than original problem. That is, an unknown text should be classified by single-label classifier when it is partitioned into lower approximation space of some class. Otherwise, it should be classified by corresponding multi-label classifier.An application system TJ-MLWC (Tongji Multi-label Web Classifier) was designed. It could call the result from Search Engines directly and classify these results real-time using improved Naïve Bayes classifier. This makes the browse process more conveniently for users. Users could locate the texts interested immediately according to the class information given by TJ-MLWC. / La thèse est centrée sur la Classification de texte, domaine en pleine expansion, avec de nombreuses applications actuelles et potentielles. Les apports principaux de la thèse portent sur deux points : Les spécificités du codage et du traitement automatique de la langue chinoise : mots pouvant être composés de un, deux ou trois caractères ; absence de séparation typographique entre les mots ; grand nombre d’ordres possibles entre les mots d’une phrase ; tout ceci aboutissant à des problèmes difficiles d’ambiguïté. La solution du codage en «n-grams »(suite de n=1, ou 2 ou 3 caractères) est particulièrement adaptée à la langue chinoise, car elle est rapide et ne nécessite pas les étapes préalables de reconnaissance des mots à l’aide d’un dictionnaire, ni leur séparation. La classification multi-labels, c'est-à-dire quand chaque individus peut être affecté à une ou plusieurs classes. Dans le cas des textes, on cherche des classes qui correspondent à des thèmes (topics) ; un même texte pouvant être rattaché à un ou plusieurs thème. Cette approche multilabel est plus générale : un même patient peut être atteint de plusieurs pathologies ; une même entreprise peut être active dans plusieurs secteurs industriels ou de services. La thèse analyse ces problèmes et tente de leur apporter des solutions, d’abord pour les classifieurs unilabels, puis multi-labels. Parmi les difficultés, la définition des variables caractérisant les textes, leur grand nombre, le traitement des tableaux creux (beaucoup de zéros dans la matrice croisant les textes et les descripteurs), et les performances relativement mauvaises des classifieurs multi-classes habituels. / 文本分类是信息科学中一个重要而且富有实际应用价值的研究领域。随着文本分类处理内容日趋复杂化和多元化,分类目标也逐渐多样化,研究有效的、切合实际应用需求的文本分类技术成为一个很有挑战性的任务,对多标签分类的研究应运而生。本文在对大量的单标签和多标签文本分类算法进行分析和研究的基础上,针对文本表示中特征高维问题、数据稀疏问题和多标签分类中分类复杂度高而精度低的问题,从不同的角度尝试运用粗糙集理论加以解决,提出了相应的算法,主要包括:针对n-gram作为中文文本特征时带来的维数灾难问题,提出了两步特征选择的方法,即去除类内稀有特征和类间特征选择相结合的方法,并就n-gram作为特征时的n值选取、特征权重的选择和特征相关性等问题在大规模中文语料库上进行了大量的实验,得出一些有用的结论。针对文本分类中运用高维特征表示文本带来的分类效率低,开销大等问题,提出了基于LDA模型的多标签文本分类算法,利用LDA模型提取的主题作为文本特征,构建高效的分类器。在PT3多标签分类转换方法下,该分类算法在中英文数据集上都表现出很好的效果,与目前公认最好的多标签分类方法效果相当。针对LDA模型现有平滑策略的随意性和武断性的缺点,提出了基于容差粗糙集的LDA语言模型平滑策略。该平滑策略首先在全局词表上构造词的容差类,再根据容差类中词的频率为每类文档的未登录词赋予平滑值。在中英文、平衡和不平衡语料库上的大量实验都表明该平滑方法显著提高了LDA模型的分类性能,在不平衡语料库上的提高尤其明显。针对多标签分类中分类复杂度高而精度低的问题,提出了一种基于可变精度粗糙集的复合多标签文本分类框架,该框架通过可变精度粗糙集方法划分文本特征空间,进而将多标签分类问题分解为若干个两类单标签分类问题和若干个标签数减少了的多标签分类问题。即,当一篇未知文本被划分到某一类文本的下近似区域时,可以直接用简单的单标签文本分类器判断其类别;当未知文本被划分在边界域时,则采用相应区域的多标签分类器进行分类。实验表明,这种分类框架下,分类的精确度和算法效率都有较大的提高。本文还设计和实现了一个基于多标签分类的网页搜索结果可视化系统(MLWC),该系统能够直接调用搜索引擎返回的搜索结果,并采用改进的Naïve Bayes多标签分类算法实现实时的搜索结果分类,使用户可以快速地定位搜索结果中感兴趣的文本。
|
Page generated in 0.1146 seconds