Global ETD Search

71	Condition-specific differential subnetwork analysis for biological systems Jhamb, Deepali 04 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Biological systems behave differently under different conditions. Advances in sequencing technology over the last decade have led to the generation of enormous amounts of condition-specific data. However, these measurements often fail to identify low abundance genes/proteins that can be biologically crucial. In this work, a novel text-mining system was first developed to extract condition-specific proteins from the biomedical literature. The literature-derived data was then combined with proteomics data to construct condition-specific protein interaction networks. Further, an innovative condition-specific differential analysis approach was designed to identify key differences, in the form of subnetworks, between any two given biological systems. The framework developed here was implemented to understand the differences between limb regeneration-competent Ambystoma mexicanum and –deficient Xenopus laevis. This study provides an exhaustive systems level analysis to compare regeneration competent and deficient subnetworks to show how different molecular entities inter-connect with each other and are rewired during the formation of an accumulation blastema in regenerating axolotl limbs. This study also demonstrates the importance of literature-derived knowledge, specific to limb regeneration, to augment the systems biology analysis. Our findings show that although the proteins might be common between the two given biological conditions, they can have a high dissimilarity based on their biological and topological properties in the subnetwork. The knowledge gained from the distinguishing features of limb regeneration in amphibians can be used in future to chemically induce regeneration in mammalian systems. The approach developed in this dissertation is scalable and adaptable to understand differential subnetworks between any two biological systems. This methodology will not only facilitate the understanding of biological processes and molecular functions which govern a given system but also provide novel intuitions about the pathophysiology of diseases/conditions. Limb regeneration Text mining Differential network analysis Subnetwork analysis Concept based mining Extremities (Anatomy) -- Regeneration Extremities (Anatomy) -- Physiology Text processing (Computer science) Data mining
72	Probabilistic tree transducers for grammatical error correction Buys, Jan Moolman 12 1900 (has links) Thesis (MSc)--Stellenbosch University, 2013. / ENGLISH ABSTRACT: We investigate the application of weighted tree transducers to correcting grammatical errors in natural language. Weighted finite-state transducers (FST) have been used successfully in a wide range of natural language processing (NLP) tasks, even though the expressiveness of the linguistic transformations they perform is limited. Recently, there has been an increase in the use of weighted tree transducers and related formalisms that can express syntax-based natural language transformations in a probabilistic setting. The NLP task that we investigate is the automatic correction of grammar errors made by English language learners. In contrast to spelling correction, which can be performed with a very high accuracy, the performance of grammar correction systems is still low for most error types. Commercial grammar correction systems mostly use rule-based methods. The most common approach in recent grammatical error correction research is to use statistical classifiers that make local decisions about the occurrence of specific error types. The approach that we investigate is related to a number of other approaches inspired by statistical machine translation (SMT) or based on language modelling. Corpora of language learner writing annotated with error corrections are used as training data. Our baseline model is a noisy-channel FST model consisting of an n-gram language model and a FST error model, which performs word insertion, deletion and replacement operations. The tree transducer model we use to perform error correction is a weighted top-down tree-to-string transducer, formulated to perform transformations between parse trees of correct sentences and incorrect sentences. Using an algorithm developed for syntax-based SMT, transducer rules are extracted from training data of which the correct version of sentences have been parsed. Rule weights are also estimated from the training data. Hypothesis sentences generated by the tree transducer are reranked using an n-gram language model. We perform experiments to evaluate the performance of different configurations of the proposed models. In our implementation an existing tree transducer toolkit is used. To make decoding time feasible sentences are split into clauses and heuristic pruning is performed during decoding. We consider different modelling choices in the construction of transducer rules. The evaluation of our models is based on precision and recall. Experiments are performed to correct various error types on two learner corpora. The results show that our system is competitive with existing approaches on several error types. / AFRIKAANSE OPSOMMING: Ons ondersoek die toepassing van geweegde boomoutomate om grammatikafoute in natuurlike taal outomaties reg te stel. Geweegde eindigetoestand outomate word suksesvol gebruik in ’n wye omvang van take in natuurlike taalverwerking, alhoewel die uitdrukkingskrag van die taalkundige transformasies wat hulle uitvoer beperk is. Daar is die afgelope tyd ’n toename in die gebruik van geweegde boomoutomate en verwante formalismes wat sintaktiese transformasies in natuurlike taal in ’n probabilistiese raamwerk voorstel. Die natuurlike taalverwerkingstoepassing wat ons ondersoek is die outomatiese regstelling van taalfoute wat gemaak word deur Engelse taalleerders. Terwyl speltoetsing in Engels met ’n baie hoë akkuraatheid gedoen kan word, is die prestasie van taalregstellingstelsels nog relatief swak vir meeste fouttipes. Kommersiële taalregstellingstelsels maak oorwegend gebruik van reël-gebaseerde metodes. Die algemeenste benadering in onlangse navorsing oor grammatikale foutkorreksie is om statistiese klassifiseerders wat plaaslike besluite oor die voorkoms van spesifieke fouttipes maak te gebruik. Die benadering wat ons ondersoek is verwant aan ’n aantal ander benaderings wat geïnspireer is deur statistiese masjienvertaling of op taalmodellering gebaseer is. Korpora van taalleerderskryfwerk wat met foutregstellings geannoteer is, word as afrigdata gebruik. Ons kontrolestelsel is ’n geraaskanaal eindigetoestand outomaatmodel wat bestaan uit ’n n-gram taalmodel en ’n foutmodel wat invoegings-, verwyderings- en vervangingsoperasies op woordvlak uitvoer. Die boomoutomaatmodel wat ons gebruik vir grammatikale foutkorreksie is ’n geweegde bo-na-onder boom-na-string omsetteroutomaat geformuleer om transformasies tussen sintaksbome van korrekte sinne en foutiewe sinne te maak. ’n Algoritme wat ontwikkel is vir sintaksgebaseerde statistiese masjienvertaling word gebruik om reëls te onttrek uit die afrigdata, waarvan sintaksontleding op die korrekte weergawe van die sinne gedoen is. Reëlgewigte word ook vanaf die afrigdata beraam. Hipotese-sinne gegenereer deur die boomoutomaat word herrangskik met behulp van ’n n-gram taalmodel. Ons voer eksperimente uit om die doeltreffendheid van verskillende opstellings van die voorgestelde modelle te evalueer. In ons implementering word ’n bestaande boomoutomaat sagtewarepakket gebruik. Om die dekoderingstyd te verminder word sinne in frases verdeel en die soekruimte heuristies besnoei. Ons oorweeg verskeie modelleringskeuses in die samestelling van outomaatreëls. Die evaluering van ons modelle word gebaseer op presisie en herroepvermoë. Eksperimente word uitgevoer om verskeie fouttipes reg te maak op twee leerderkorpora. Die resultate wys dat ons model kompeterend is met bestaande benaderings op verskeie fouttipes. Grammar correction -- Data processing Natural language processing Weighted tree transducer Text processing (Computer science) Dissertations -- Mathematical sciences Theses -- Mathematical sciences Dissertations -- Computer science Theses -- Computer science Computational linguistics English language -- Grammar
73	Classificação de sites a partir das análises estrutural e textual Ribas, Oeslei Taborda 28 August 2013 (has links) Com a ampla utilização da web nos dias atuais e também com o seu crescimento constante, a tarefa de classificação automática de sítios web têm adquirido importância crescente, pois em diversas ocasiões é necessário bloquear o acesso a sítios específicos, como por exemplo no caso do acesso a sítios de conteúdo adulto em escolas elementares e secundárias. Na literatura diferentes trabalhos têm surgido propondo novos métodos de classificação de sítios, com o objetivo de aumentar o índice de páginas corretamente categorizadas. Este trabalho tem por objetivo contribuir com os métodos atuais de classificação através de comparações de quatro aspectos envolvidos no processo de classificação: algoritmos de classificação, dimensionalidade (número de atributos considerados), métricas de avaliação de atributos e seleção de atributos textuais e estruturais presentes nas páginas web. Utiliza-se o modelo vetorial para o tratamento de textos e uma abordagem de aprendizagem de máquina clássica considerando a tarefa de classificação. Diversas métricas são utilizadas para fazer a seleção dos termos mais relevantes, e algoritmos de classificação de diferentes paradigmas são comparados: probabilista (Naıve Bayes), árvores de decisão (C4.5), aprendizado baseado em instâncias (KNN - K vizinhos mais próximos) e Máquinas de Vetores de Suporte (SVM). Os experimentos foram realizados em um conjunto de dados contendo sítios de dois idiomas, Português e Inglês. Os resultados demonstram que é possível obter um classificador com bons índices de acerto utilizando apenas as informações do texto ˆancora dos hyperlinks. Nos experimentos o classificador baseado nessas informações atingiu uma Medida-F de 99.59%. / With the wide use of the web nowadays, also with its constant growth, task of automatic classification of websites has gained increasing importance. In many occasions it is necessary to block access to specific sites, such as in the case of access to adult content sites in elementary and secondary schools. In the literature different studies has appeared proposing new methods for classification of sites, with the goal of increasing the rate of pages correctly categorized. This work aims to contribute to the current methods of classification by comparing four aspects involved in the classification process: classification algorithms, dimensionality (amount of selected attributes), attributes evaluation metrics and selection of textual and structural attributes present in webpages. We use the vector model to treat text and an machine learning classical approach according to the classification task. Several metrics are used to make the selection of the most relevant terms, and classification algorithms from different paradigms are compared: probabilistic (Na¨ıve Bayes), decision tree (C4.5), instance-based learning (KNN - K-Nearest Neighbor) and support vector machine (SVM). The experiments were performed on a dataset containing two languages, English and Portuguese. The results show that it is possible to obtain a classifier with good success indexes using only the information from the anchor text in hyperlinks, in the experiments the classifier based on this information achieved 99.59% F-measure. Processamento de textos (Computação) Aprendizado do computador Redes neurais (Computação) Métodos de simulação Web sites - Ratings and rankings Text processing (Computer science) Machine learning Neural networks (Computer science) HTML (Document marKup language) Simulation methods
74	Classificação de sites a partir das análises estrutural e textual Ribas, Oeslei Taborda 28 August 2013 (has links) Com a ampla utilização da web nos dias atuais e também com o seu crescimento constante, a tarefa de classificação automática de sítios web têm adquirido importância crescente, pois em diversas ocasiões é necessário bloquear o acesso a sítios específicos, como por exemplo no caso do acesso a sítios de conteúdo adulto em escolas elementares e secundárias. Na literatura diferentes trabalhos têm surgido propondo novos métodos de classificação de sítios, com o objetivo de aumentar o índice de páginas corretamente categorizadas. Este trabalho tem por objetivo contribuir com os métodos atuais de classificação através de comparações de quatro aspectos envolvidos no processo de classificação: algoritmos de classificação, dimensionalidade (número de atributos considerados), métricas de avaliação de atributos e seleção de atributos textuais e estruturais presentes nas páginas web. Utiliza-se o modelo vetorial para o tratamento de textos e uma abordagem de aprendizagem de máquina clássica considerando a tarefa de classificação. Diversas métricas são utilizadas para fazer a seleção dos termos mais relevantes, e algoritmos de classificação de diferentes paradigmas são comparados: probabilista (Naıve Bayes), árvores de decisão (C4.5), aprendizado baseado em instâncias (KNN - K vizinhos mais próximos) e Máquinas de Vetores de Suporte (SVM). Os experimentos foram realizados em um conjunto de dados contendo sítios de dois idiomas, Português e Inglês. Os resultados demonstram que é possível obter um classificador com bons índices de acerto utilizando apenas as informações do texto ˆancora dos hyperlinks. Nos experimentos o classificador baseado nessas informações atingiu uma Medida-F de 99.59%. / With the wide use of the web nowadays, also with its constant growth, task of automatic classification of websites has gained increasing importance. In many occasions it is necessary to block access to specific sites, such as in the case of access to adult content sites in elementary and secondary schools. In the literature different studies has appeared proposing new methods for classification of sites, with the goal of increasing the rate of pages correctly categorized. This work aims to contribute to the current methods of classification by comparing four aspects involved in the classification process: classification algorithms, dimensionality (amount of selected attributes), attributes evaluation metrics and selection of textual and structural attributes present in webpages. We use the vector model to treat text and an machine learning classical approach according to the classification task. Several metrics are used to make the selection of the most relevant terms, and classification algorithms from different paradigms are compared: probabilistic (Na¨ıve Bayes), decision tree (C4.5), instance-based learning (KNN - K-Nearest Neighbor) and support vector machine (SVM). The experiments were performed on a dataset containing two languages, English and Portuguese. The results show that it is possible to obtain a classifier with good success indexes using only the information from the anchor text in hyperlinks, in the experiments the classifier based on this information achieved 99.59% F-measure. Processamento de textos (Computação) Aprendizado do computador Redes neurais (Computação) Métodos de simulação Web sites - Ratings and rankings Text processing (Computer science) Machine learning Neural networks (Computer science) HTML (Document marKup language) Simulation methods
75	Odysseýs : sistema para análise de documentos de patentes / Odysseýs : system for analysis of patent documents Masago, Fábio Kenji, 1984 04 August 2013 (has links) Orientador: Jacques Wainer / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-22T23:44:38Z (GMT). No. of bitstreams: 1 Masago_FabioKenji_M.pdf: 2909118 bytes, checksum: 6db84a869c4da011cf0f5cd7114bcf63 (MD5) Previous issue date: 2013 / Resumo: Uma patente é um documento sobre uma propriedade de criação concedida pelo Estado aos autores, que impede terceiros a produzir, utilizar, comercializar, importar e exportar a invenção descrita sem a devida autorização do titular do documento. Um estudo na área econômico muito empregado é a utilização de patentes para medir a importância ou impacto tecnológico de um campo inovativo de uma entidade ou nação. Pode-se afirmar que as patentes são como uma espécie de medidores do nível inventivo e as citações contidas nas patentes são um meio para medir o fluxo ou os impactos do conhecimento de um país ou firma, assim como, avaliar tendências de um campo tecnológico. A presente dissertação de mestrado apresenta o desenvolvimento de uma ferramenta para auxiliar no procedimento de análise de patentes, abordando a aplicabilidade do método Latent Dirichlet Allocation (LDA) para o processo de similaridade de patentes. O sistema computacional denominado Odysseýs verifica a similaridade entre uma determinada patente dada pelo usuário e um grupo de documentos, ordenando-os conforme o seu grau de semelhança em relação à patente em avaliação. Além disso, o software permite, de forma não supervisionada, a geração de redes de citações de patentes por meio de buscas de um conjunto de patentes correlacionadas na base de dados do United States Patent and Trademark Office (USPTO) a partir de uma consulta designada pelo usuário, utilizando essas patentes para a análise de similaridade e, também, para a geração da rede de fluxo de conhecimento. A inexistência de softwares nacionais específicos para o processamento de patentes e as poucas ferramentas auxiliares para a análise de tais documentos foram às principais motivações para o desenvolvimento do projeto / Abstract: A patent is a document about an invention's property given by the state to authors, preventing others from producing, using, commercialize, importing and exporting the described invention without a permission of the document's owner. A study in the economic area frequently used is the use of patents to measure importance or technological impact of an innovative field of an entity or nation. Thus, can be asserted that patents are a kind of inventive level meter and their citations is a form of measuring a country's or firm's flow or the impact of knowledge, as well as evaluate trends in a certain technological field. This thesis presents a computational tool to assist in the process of patents analysis, approaching the applicability of the method Latent Dirichlet Allocation (LDA) for the similarity of patents. The computational system called Odysseýs evaluates the similarity between a patent given by the user and a group of documents, ordering them according to their similarity degree in relation to evaluated patent. In addition, the software allows, in an unsupervised manner, generate a patent citation's network by searches for a set of related patents in the database United States Patent and Trademark Office (USPTO) through a query designated by the user applying those patents to the similarity analysis, and also for generation of a knowledge flow network. The inexistence of national software for patent processing and only a few auxiliary tools for the analysis of such documents were the main motivations for the development of this project / Mestrado / Ciência da Computação / Mestre em Ciência da Computação Mineração de dados (Computação) Processamento de textos (Computação) Programas de computador - Patentes Análise de algoritmos Data mining Text processing (Computer science) Computer programs - Patents Algorithm analysis
76	Pharmacodynamics miner : an automated extraction of pharmacodynamic drug interactions Lokhande, Hrishikesh 11 December 2013 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Pharmacodynamics (PD) studies the relationship between drug concentration and drug effect on target sites. This field has recently gained attention as studies involving PD Drug-Drug interactions (DDI) assure discovery of multi-targeted drug agents and novel efficacious drug combinations. A PD drug combination could be synergistic, additive or antagonistic depending upon the summed effect of the drug combination at a target site. The PD literature has grown immensely and most of its knowledge is dispersed across different scientific journals, thus the manual identification of PD DDI is a challenge. In order to support an automated means to extract PD DDI, we propose Pharmacodynamics Miner (PD-Miner). PD-Miner is a text-mining tool, which is capable of identifying PD DDI from in vitro PD experiments. It is powered by two major features, i.e., collection of full text articles and in vitro PD ontology. The in vitro PD ontology currently has four classes and more than hundred subclasses; based on these classes and subclasses the full text corpus is annotated. The annotated full text corpus forms a database of articles, which can be queried based upon drug keywords and ontology subclasses. Since the ontology covers term and concept meanings, the system is capable of formulating semantic queries. PD-Miner extracts in vitro PD DDI based upon references to cell lines and cell phenotypes. The results are in the form of fragments of sentences in which important concepts are visually highlighted. To determine the accuracy of the system, we used a gold standard of 5 expert curated articles. PD-Miner identified DDI with a recall of 75% and a precision of 46.55%. Along with the development of PD Miner, we also report development of a semantically annotated in vitro PD corpus. This corpus includes term and sentence level annotations and serves as a gold standard for future text mining. Drug interactions -- Research Drugs -- Physiological effect Bioinformatics -- Research -- Analysis Semantics -- Data processing Semantics -- Network analysis Text processing (Computer science) Computational linguistics Computational complexity Pharmacokinetics Querying (Computer science) -- Research
77	Development of isiXhosa text-to-speech modules to support e-Services in marginalized rural areas Mhlana, Siphe January 2011 (has links) Information and Communication Technology (ICT) projects are being initiated and deployed in marginalized areas to help improve the standard of living for community members. This has lead to a new field, which is responsible for information processing and knowledge development in rural areas, called Information and Communication Technology for Development (ICT4D). An ICT4D projects has been implemented in a marginalized area called Dwesa; this is a rural area situated in the wild coast of the former homelandof Transkei, in the Eastern Cape Province of South Africa. In this rural community there are e-Service projects which have been developed and deployed to support the already existent ICT infrastructure. Some of these projects include the e-Commerce platform, e-Judiciary service, e-Health and e-Government portal. Although these projects are deployed in this area, community members face a language and literacy barrier because these services are typically accessed through English textual interfaces. This becomes a challenge because their language of communication is isiXhosa and some of the community members are illiterate. Most of the rural areas consist of illiterate people who cannot read and write isiXhosa but can only speak the language. This problem of illiteracy in rural areas affects both the youth and the elderly. This research seeks to design, develop and implement software modules that can be used to convert isiXhosa text into natural sounding isiXhosa speech. Such an application is called a Text-to-Speech (TTS) system. The main objective of this research is to improve ICT4D eServices’ usability through the development of an isiXhosa Text-to-Speech system. This research is undertaken within the context of Siyakhula Living Lab (SLL), an ICT4D intervention towards improving the lives of rural communities of South Africa in an attempt to bridge the digital divide. Thedeveloped TTS modules were subsequently tested to determine their applicability to improve eServices usability. The results show acceptable levels of usability as having produced audio utterances for the isiXhosa Text-To-Speech system for marginalized areas. Literacy -- South Africa -- Eastern Cape Transmission of texts -- Data processing
78	An exploratory study using the predicate-argument structure to develop methodology for measuring semantic similarity of radiology sentences Newsom, Eric Tyner 12 November 2013 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / The amount of information produced in the form of electronic free text in healthcare is increasing to levels incapable of being processed by humans for advancement of his/her professional practice. Information extraction (IE) is a sub-field of natural language processing with the goal of data reduction of unstructured free text. Pertinent to IE is an annotated corpus that frames how IE methods should create a logical expression necessary for processing meaning of text. Most annotation approaches seek to maximize meaning and knowledge by chunking sentences into phrases and mapping these phrases to a knowledge source to create a logical expression. However, these studies consistently have problems addressing semantics and none have addressed the issue of semantic similarity (or synonymy) to achieve data reduction. To achieve data reduction, a successful methodology for data reduction is dependent on a framework that can represent currently popular phrasal methods of IE but also fully represent the sentence. This study explores and reports on the benefits, problems, and requirements to using the predicate-argument statement (PAS) as the framework. A convenient sample from a prior study with ten synsets of 100 unique sentences from radiology reports deemed by domain experts to mean the same thing will be the text from which PAS structures are formed. Natural Language Processing Information Extraction Predicate-Argument Structure Semantic Similarity Computational linguistics -- Analysis Semantic computing -- Research Semantics -- Data processing Description logics Data mining Semantic Web Text processing (Computer science) Predicate (Logic) Medical informatics -- Data processing

Search results