Spelling suggestions: "subject:"text aprocessing"" "subject:"text eprocessing""
151 |
Classificação de sites a partir das análises estrutural e textualRibas, Oeslei Taborda 28 August 2013 (has links)
Com a ampla utilização da web nos dias atuais e também com o seu crescimento constante, a tarefa de classificação automática de sítios web têm adquirido importância crescente, pois em diversas ocasiões é necessário bloquear o acesso a sítios específicos, como por exemplo no caso do acesso a sítios de conteúdo adulto em escolas elementares e secundárias. Na literatura diferentes trabalhos têm surgido propondo novos métodos de classificação de sítios, com o objetivo de aumentar o índice de páginas corretamente categorizadas. Este trabalho tem por objetivo contribuir com os métodos atuais de classificação através de comparações de quatro aspectos envolvidos no processo de classificação: algoritmos de classificação, dimensionalidade (número de atributos considerados), métricas de avaliação de atributos e seleção de atributos textuais e estruturais presentes nas páginas web. Utiliza-se o modelo vetorial para o tratamento de textos e uma abordagem de aprendizagem de máquina clássica considerando a tarefa de classificação. Diversas métricas são utilizadas para fazer a seleção dos termos mais relevantes, e algoritmos de classificação de diferentes paradigmas são comparados: probabilista (Naıve Bayes), árvores de decisão (C4.5), aprendizado baseado em instâncias (KNN - K vizinhos mais próximos) e Máquinas de Vetores de Suporte (SVM). Os experimentos foram realizados em um conjunto de dados contendo sítios de dois idiomas, Português e Inglês. Os resultados demonstram que é possível obter um classificador com bons índices de acerto utilizando apenas as informações do texto ˆancora dos hyperlinks. Nos experimentos o classificador baseado nessas informações atingiu uma Medida-F de 99.59%. / With the wide use of the web nowadays, also with its constant growth, task of automatic classification of websites has gained increasing importance. In many occasions it is necessary to block access to specific sites, such as in the case of access to adult content sites in elementary and secondary schools. In the literature different studies has appeared proposing new methods for classification of sites, with the goal of increasing the rate of pages correctly categorized. This work aims to contribute to the current methods of classification by comparing four aspects involved in the classification process: classification algorithms, dimensionality (amount of selected attributes), attributes evaluation metrics and selection of textual and structural attributes present in webpages. We use the vector model to treat text and an machine learning classical approach according to the classification task. Several metrics are used to make the selection of the most relevant terms, and classification algorithms from different paradigms are compared: probabilistic (Na¨ıve Bayes), decision tree (C4.5), instance-based learning (KNN - K-Nearest Neighbor) and support vector machine (SVM). The experiments were performed on a dataset containing two languages, English and Portuguese. The results show that it is possible to obtain a classifier with good success indexes using only the information from the anchor text in hyperlinks, in the experiments the classifier based on this information achieved 99.59% F-measure.
|
152 |
Classificação de sites a partir das análises estrutural e textualRibas, Oeslei Taborda 28 August 2013 (has links)
Com a ampla utilização da web nos dias atuais e também com o seu crescimento constante, a tarefa de classificação automática de sítios web têm adquirido importância crescente, pois em diversas ocasiões é necessário bloquear o acesso a sítios específicos, como por exemplo no caso do acesso a sítios de conteúdo adulto em escolas elementares e secundárias. Na literatura diferentes trabalhos têm surgido propondo novos métodos de classificação de sítios, com o objetivo de aumentar o índice de páginas corretamente categorizadas. Este trabalho tem por objetivo contribuir com os métodos atuais de classificação através de comparações de quatro aspectos envolvidos no processo de classificação: algoritmos de classificação, dimensionalidade (número de atributos considerados), métricas de avaliação de atributos e seleção de atributos textuais e estruturais presentes nas páginas web. Utiliza-se o modelo vetorial para o tratamento de textos e uma abordagem de aprendizagem de máquina clássica considerando a tarefa de classificação. Diversas métricas são utilizadas para fazer a seleção dos termos mais relevantes, e algoritmos de classificação de diferentes paradigmas são comparados: probabilista (Naıve Bayes), árvores de decisão (C4.5), aprendizado baseado em instâncias (KNN - K vizinhos mais próximos) e Máquinas de Vetores de Suporte (SVM). Os experimentos foram realizados em um conjunto de dados contendo sítios de dois idiomas, Português e Inglês. Os resultados demonstram que é possível obter um classificador com bons índices de acerto utilizando apenas as informações do texto ˆancora dos hyperlinks. Nos experimentos o classificador baseado nessas informações atingiu uma Medida-F de 99.59%. / With the wide use of the web nowadays, also with its constant growth, task of automatic classification of websites has gained increasing importance. In many occasions it is necessary to block access to specific sites, such as in the case of access to adult content sites in elementary and secondary schools. In the literature different studies has appeared proposing new methods for classification of sites, with the goal of increasing the rate of pages correctly categorized. This work aims to contribute to the current methods of classification by comparing four aspects involved in the classification process: classification algorithms, dimensionality (amount of selected attributes), attributes evaluation metrics and selection of textual and structural attributes present in webpages. We use the vector model to treat text and an machine learning classical approach according to the classification task. Several metrics are used to make the selection of the most relevant terms, and classification algorithms from different paradigms are compared: probabilistic (Na¨ıve Bayes), decision tree (C4.5), instance-based learning (KNN - K-Nearest Neighbor) and support vector machine (SVM). The experiments were performed on a dataset containing two languages, English and Portuguese. The results show that it is possible to obtain a classifier with good success indexes using only the information from the anchor text in hyperlinks, in the experiments the classifier based on this information achieved 99.59% F-measure.
|
153 |
Odysseýs : sistema para análise de documentos de patentes / Odysseýs : system for analysis of patent documentsMasago, Fábio Kenji, 1984 04 August 2013 (has links)
Orientador: Jacques Wainer / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-22T23:44:38Z (GMT). No. of bitstreams: 1
Masago_FabioKenji_M.pdf: 2909118 bytes, checksum: 6db84a869c4da011cf0f5cd7114bcf63 (MD5)
Previous issue date: 2013 / Resumo: Uma patente é um documento sobre uma propriedade de criação concedida pelo Estado aos autores, que impede terceiros a produzir, utilizar, comercializar, importar e exportar a invenção descrita sem a devida autorização do titular do documento. Um estudo na área econômico muito empregado é a utilização de patentes para medir a importância ou impacto tecnológico de um campo inovativo de uma entidade ou nação. Pode-se afirmar que as patentes são como uma espécie de medidores do nível inventivo e as citações contidas nas patentes são um meio para medir o fluxo ou os impactos do conhecimento de um país ou firma, assim como, avaliar tendências de um campo tecnológico. A presente dissertação de mestrado apresenta o desenvolvimento de uma ferramenta para auxiliar no procedimento de análise de patentes, abordando a aplicabilidade do método Latent Dirichlet Allocation (LDA) para o processo de similaridade de patentes. O sistema computacional denominado Odysseýs verifica a similaridade entre uma determinada patente dada pelo usuário e um grupo de documentos, ordenando-os conforme o seu grau de semelhança em relação à patente em avaliação. Além disso, o software permite, de forma não supervisionada, a geração de redes de citações de patentes por meio de buscas de um conjunto de patentes correlacionadas na base de dados do United States Patent and Trademark Office (USPTO) a partir de uma consulta designada pelo usuário, utilizando essas patentes para a análise de similaridade e, também, para a geração da rede de fluxo de conhecimento. A inexistência de softwares nacionais específicos para o processamento de patentes e as poucas ferramentas auxiliares para a análise de tais documentos foram às principais motivações para o desenvolvimento do projeto / Abstract: A patent is a document about an invention's property given by the state to authors, preventing others from producing, using, commercialize, importing and exporting the described invention without a permission of the document's owner. A study in the economic area frequently used is the use of patents to measure importance or technological impact of an innovative field of an entity or nation. Thus, can be asserted that patents are a kind of inventive level meter and their citations is a form of measuring a country's or firm's flow or the impact of knowledge, as well as evaluate trends in a certain technological field. This thesis presents a computational tool to assist in the process of patents analysis, approaching the applicability of the method Latent Dirichlet Allocation (LDA) for the similarity of patents. The computational system called Odysseýs evaluates the similarity between a patent given by the user and a group of documents, ordering them according to their similarity degree in relation to evaluated patent. In addition, the software allows, in an unsupervised manner, generate a patent citation's network by searches for a set of related patents in the database United States Patent and Trademark Office (USPTO) through a query designated by the user applying those patents to the similarity analysis, and also for generation of a knowledge flow network. The inexistence of national software for patent processing and only a few auxiliary tools for the analysis of such documents were the main motivations for the development of this project / Mestrado / Ciência da Computação / Mestre em Ciência da Computação
|
154 |
Exploration of an Automated Motivation Letter Scoring System to Emulate Human JudgementMunnecom, Lorenna, Pacheco, Miguel Chaves de Lemos January 2020 (has links)
As the popularity of the master’s in data science at Dalarna University increases, so does the number of applicants. The aim of this thesis was to explore different approaches to provide an automated motivation letter scoring system which could emulate the human judgement and automate the process of candidate selection. Several steps such as image processing and text processing were required to enable the authors to retrieve numerous features which could lead to the identification of the factors graded by the program managers. Grammatical based features and Advanced textual features were extracted from the motivation letters followed by the application of Topic Modelling methods to extract the probability of each topics occurring within a motivation letter. Furthermore, correlation analysis was applied to quantify the association between the features and the different factors graded by the program managers, followed by Ordinal Logistic Regression and Random Forest to build models with the most impactful variables. Finally, Naïve Bayes Algorithm, Random Forest and Support Vector Machine were used, first for classification and then for prediction purposes. These results were not promising as the factors were not accurately identified. Nevertheless, the authors suspected that the factors may be strongly related to the highlight of specific topics within a motivation letter which can lead to further research.
|
155 |
Zpracování zákaznických požadavků za použití hlubokých neuronových sítí / Deep Neural Networks Used for Customer Support Cases AnalysisMarušic, Marek January 2018 (has links)
Umelá inteligencia je pozoruhodne populárna v dnešnej dobe, pretože si dokáže poradiť s rôznymi veľmi komplexnými úlohami v odvetviach ako napr. spracovanie obrazu, spracovanie zvuku, spracovanie prirodzeného jazyka a podobne. Keďže Red Hat doteraz už vyriešil obrovksé množstvo zákazníckych požiadavkov počas podpory rôznych produktov. Preto bola navrhnutá myšlienka použiť umelú inteligenciu práve na tieto dáta a docieliť tak zlepšenie a zrýchlenie procesu riešenia zákaznícky požiadavkov. V tejto práci sú popísané použité techniky na spracovanie týchto dát a úlohy, ktoré je možné riešiť pomocou hlbokých neurónových sietí. Taktiež sú v tejto práci popísane rôzne modely, ktoré boli vytvorené počas riešenia tejto práce a snažia sa adresovať rôzne úlohy. Ich výkony sú porovnané na spomínaných úlohách.
|
156 |
A Confirmatory Analysis for Automating the Evaluation of Motivation Letters to Emulate Human JudgmentMercado Salazar, Jorge Anibal, Rana, S M Masud January 2021 (has links)
Manually reading, evaluating, and scoring motivation letters as part of the admissions process is a time-consuming and tedious task for Dalarna University's program managers. An automated scoring system would provide them with relief as well as the ability to make much faster decisions when selecting applicants for admission. The aim of this thesis was to analyse current human judgment and attempt to emulate it using machine learning techniques. We used various topic modelling methods, such as Latent Dirichlet Allocation and Non-Negative Matrix Factorization, to find the most interpretable topics, build a bridge between topics and human-defined factors, and finally evaluate model performance by predicting scoring values and finding accuracy using logistic regression, discriminant analysis, and other classification algorithms. Despite the fact that we were able to discover the meaning of almost all human factors on our own, the topic models' accuracy in predicting overall score was unexpectedly low. Setting a threshold on overall score to select applicants for admission yielded a good overall accuracy result, but did not yield a good consistent precision or recall score. During our investigation, we attempted to determine the possible causes of these unexpected results and discovered that not only is topic modelling limitation to blame, but human bias also plays a role.
|
157 |
Pharmacodynamics miner : an automated extraction of pharmacodynamic drug interactionsLokhande, Hrishikesh 11 December 2013 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Pharmacodynamics (PD) studies the relationship between drug concentration and drug effect on target sites. This field has recently gained attention as studies involving PD Drug-Drug interactions (DDI) assure discovery of multi-targeted drug agents and novel efficacious drug combinations. A PD drug combination could be synergistic, additive or antagonistic depending upon the summed effect of the drug combination at a target site. The PD literature has grown immensely and most of its knowledge is dispersed across different scientific journals, thus the manual identification of PD DDI is a challenge. In order to support an automated means to extract PD DDI, we propose Pharmacodynamics Miner (PD-Miner). PD-Miner is a text-mining tool, which is capable of identifying PD DDI from in vitro PD experiments. It is powered by two major features, i.e., collection of full text articles and in vitro PD ontology. The in vitro PD ontology currently has four classes and more than hundred subclasses; based on these classes and subclasses the full text corpus is annotated. The annotated full text corpus forms a database of articles, which can be queried based upon drug keywords and ontology subclasses. Since the ontology covers term and concept meanings, the system is capable of formulating semantic queries. PD-Miner extracts in vitro PD DDI based upon references to cell lines and cell phenotypes. The results are in the form of fragments of sentences in which important concepts are visually highlighted. To determine the accuracy of the system, we used a gold standard of 5 expert curated articles. PD-Miner identified DDI with a recall of 75% and a precision of 46.55%. Along with the development of PD Miner, we also report development of a semantically annotated in vitro PD corpus. This corpus includes term and sentence level annotations and serves as a gold standard for future text mining.
|
158 |
Dialogue homme-machine multimodal : de la pragmatique linguistique à la conception de systèmesLandragin, Frédéric 28 June 2013 (has links) (PDF)
Un des objectifs fondamentaux du dialogue homme-machine est de se rapprocher du dialogue naturel en langage naturel, c'est-à-dire de permettre une interaction entre la machine et son utilisateur humain dans la langue de celui-ci (langage naturel), avec une structure d'échanges similaire à un dialogue humain (dialogue naturel). Les recherches impliquées se nourrissent de travaux linguistiques qui analysent la langue et de travaux pragmatiques qui analysent l'usage du langage en contexte. Deux facettes importantes de la pragmatique linguistique portent ainsi sur les phénomènes de référence, par exemple les désignations des objets accessibles dans le contexte situationnel, et sur les actes de langage, ou actes de dialogue, c'est-à-dire les actions communicatives effectuées par les énoncés constituant les tours de parole. Nous présentons nos travaux de modélisation et de formalisation de ces deux facettes, avec leur application au dialogue avec support visuel et au dialogue associant parole et gestes co-verbaux (dialogue multimodal). Un autre objectif du dialogue homme-machine est de mettre en oeuvre des méthodologies et des moyens, par exemple des architectures logicielles réutilisables, pour faciliter le développement de systèmes. Nous présentons nos réflexions et nos réalisations dans ce sens, à travers notamment notre participation à un ensemble de projets européens. Nous proposons enfin des perspectives de recherche qui visent à mieux intégrer au dialogue homme-machine des phénomènes linguistiques et pragmatiques telles que la saillance et l'ambiguïté.
|
159 |
Semantics and Knowledge Engineering for Requirements and Synthesis in Conceptual Design: Towards the Automation of Requirements Clarification and the Synthesis of Conceptual Design SolutionsChristophe, François 27 July 2012 (has links) (PDF)
This thesis suggests the use of tools from the disciplines of Computational Linguistics and Knowledge Representation with the idea that such tools would enable the partial automation of two processes of Conceptual Design: the analysis of Requirements and the synthesis of concepts of solution. The viewpoint on Conceptual Design developed in this research is based on the systematic methodologies developed in the literature. The evolution of these methodologies provided precise description of the tasks to be achieved by the designing team in order to achieve successful design. Therefore, the argument of this thesis is that it is possible to create computer models of some of these tasks in order to partially automate the refinement of the design problem and the exploration of the design space. In Requirements Engineering, the definition of requirements consists in identifying the needs of various stakeholders and formalizing it into design speciႡcations. During this task, designers face the problem of having to deal with individuals from different expertise, expressing their needs with different levels of clarity. This research tackles this issue with requirements expressed in natural language (in this case in English). The analysis of needs is realised from different linguistic levels: lexical, syntactic and semantic. The lexical level deals with the meaning of words of a language. Syntactic analysis provides the construction of the sentence in language, i.e. the grammar of a language. The semantic level aims at Ⴁnding about the specific meaning of words in the context of a sentence. This research makes extensive use of a semantic atlas based on the concept of clique from graph theory. Such concept enables the computation of distances between a word and its synonyms. Additionally, a methodology and a metric of similarity was defined for clarifying requirements at syntactic, lexical and semantic levels. This methodology integrates tools from research collaborators. In the synthesis process, a Knowledge Representation of the necessary concepts for enabling computers to create concepts of solution was developed. Such, concepts are: function, input/output Ⴂow, generic organs, behaviour, components. The semantic atlas is also used at that stage to enable a mapping between functions and their solutions. It works as the interface between the concepts of this Knowledge Representation.
|
160 |
Algorithmes bio-informatiques pour l'analyse de données de séquençage à haut débitKopylova, Evguenia 11 December 2013 (has links) (PDF)
Nucleotide sequence alignment is a method used to identify regions of similarity between organisms at the genomic level. In this thesis we focus on the alignment of millions of short sequences produced by Next-Generation Sequencing (NGS) technologies against a reference database. Particularly, we direct our attention toward the analysis of metagenomic and metatranscriptomic data, that is the DNA and RNA directly extracted for an environment. Two major challenges were confronted in our developed algorithms. First, all NGS technologies today are susceptible to sequencing errors in the form of nucleotide substitutions, insertions and deletions and error rates vary between 1-15%. Second, metagenomic samples can contain thousands of unknown organisms and the only means of identifying them is to align against known closely related species. To overcome these challenges we designed a new approximate matching technique based on the universal Levenshtein automaton which quickly locates short regions of similarity (seeds) between two sequences allowing 1 error of any type. Using seeds to detect possible high scoring alignments is a widely used heuristic for rapid sequence alignment, although most existing software are optimized for performing high similarity searches and apply exact seeds. Furthermore, we describe a new indexing data structure based on the Burst trie which optimizes the search for approximate seeds. We demonstrate the efficacy of our method in two implemented software, SortMeRNA and SortMeDNA. The former can quickly filter ribosomal RNA fragments from metatranscriptomic data and the latter performs full alignment for genomic and metagenomic data.
|
Page generated in 0.0998 seconds