11 |
"Seleção de atributos importantes para a extração de conhecimento de bases de dados" / "Selection of important features for knowledge extraction from data bases"Lee, Huei Diana 16 December 2005 (has links)
O desenvolvimento da tecnologia e a propagação de sistemas computacionais nos mais variados domínios do conhecimento têm contribuído para a geração e o armazenamento de uma quantidade constantemente crescente de dados, em uma velocidade maior da que somos capazes de processar. De um modo geral, a principal razão para o armazenamento dessa enorme quantidade de dados é a utilização deles em benefício da humanidade. Diversas áreas têm se dedicado à pesquisa e a proposta de métodos e processos para tratar esses dados. Um desses processos é a Descoberta de Conhecimento em Bases de Dados, a qual tem como objetivo extrair conhecimento a partir das informações contidas nesses dados. Para alcançar esse objetivo, usualmente são construídos modelos (hipóteses), os quais podem ser gerados com o apoio de diferentes áreas tal como a de Aprendizado de Máquina. A Seleção de Atributos desempenha uma tarefa essencial dentro desse processo, pois representa um problema de fundamental importância em aprendizado de máquina, sendo freqüentemente realizada como uma etapa de pré-processamento. Seu objetivo é selecionar os atributos mais importantes, pois atributos não relevantes e/ou redundantes podem reduzir a precisão e a compreensibilidade das hipóteses induzidas por algoritmos de aprendizado supervisionado. Vários algoritmos para a seleção de atributos relevantes têm sido propostosna literatura. Entretanto, trabalhos recentes têm mostrado que também deve-se levar em conta a redundância para selecionar os atributos importantes, pois os atributos redundantes também afetam a qualidade das hipóteses induzidas. Para selecionar alguns e descartar outros, é preciso determinar a importância dos atributos segundo algum critério. Entre os vários critérios de importância de atributos propostos, alguns estão baseados em medidas de distância, consistência ou informação, enquanto outros são fundamentados em medidas de dependência. Outra questão essencial são as avaliações experimentais, as quais representam um importante instrumento de estimativa de performance de algoritmos de seleção de atributos, visto que não existe análise matemática que permita predizer que algoritmo de seleção de atributos será melhor que outro. Essas comparações entre performance de algoritmos são geralmente realizadas por meio da análise do erro do modelo construído a partir dos subconjuntos de atributos selecionados por esses algoritmos. Contudo, somente a consideração desse parâmetro não é suficiente; outras questões devem ser consideradas, tal como a percentagem de redução da quantidade de atributos desses subconjuntos de atributos selecionados. Neste trabalho é proposto um algoritmo que separa as análises de relevância e de redundância de atributos e introduz a utilização da Dimensão Fractal para tratar atributos redundantes em aprendizado supervisionado. É também proposto um modelo de avaliação de performance de algoritmos de seleção de atributos baseado no erro da hipótese construída e na percentagem de redução da quantidade de atributos selecionados. Resultados experimentais utilizando vários conjuntos de dados e diversos algoritmos consolidados na literatura, que selecionam atributos importantes, mostram que nossa proposta é competitiva com esses algoritmos. Outra questão importante relacionada à extração de conhecimento a partir de bases de dados é o formato no qual os dados estão representados. Usualmente, é necessário que os exemplos estejam descritos no formato atributo-valor. Neste trabalho também propomos um metodologia para dar suporte, por meio de um processo semi-automático, à construção de conjuntos de dados nesse formato, originados de informações de pacientes contidas em laudos médicos que estão descritos em linguagem natural. Esse processo foi aplicado com sucesso a um caso real. / Progress in computer systems and devices applied to a different number of fields, have made it possible to collect and store an increasing amount of data. Moreover, this technological advance enables the storage of a huge amount of data which is difficult to process unless new approaches are used. The main reason to maintain all these data is to use it in a general way for the benefit of humanity. Many areas are engaged in the research and proposal of methods and processes to deal with this growing data. One such process is Knowledge Discovery from Databases, which aims at finding valuable and interesting knowledge which may be hidden inside the data. In order to extract knowledge from data, models (hypothesis) are usually developed supported by many fields such as Machine Learning. Feature Selection plays an important role in this process since it represents a central problem in machine learning and is frequently applied as a data pre-processing step. Its objective is to choose a subset from the original features that describes a data set, according to some importance criterion, by removing irrelevant and/or redundant features, as they may decrease data quality and reduce comprehensibility of hypotheses induced by supervised learning algorithms. Most of the state-of-art feature selection algorithms mainly focus on finding relevant features. However, it has been shown that relevance alone is not sufficient to select important features. Different approaches have been proposed to select features, among them the filter approach. The idea of this approach is to remove features before the model's induction takes place, based on general characteristics from the data set. For the purpose of selecting features and discarding others, it is necessary to measure the features' goodness, and many importance measures have been proposed. Some of them are based on distance measures, consistency of data and information content, while others are founded on dependence measures. As there is no mathematical analysis capable of predicting whether a feature selection algorithm will produce better feature subsets than others, it is important to empirically evaluate the performance of these algorithms. Comparisons among algorithms' performance is usually carried out through the model's error analysis. Nevertheless, this sole parameter is not complete enough, and other issues, such as percentage of the feature's subset reduction should also be taken into account. In this work we propose a filter that decouples features' relevance and redundancy analysis, and introduces the use of Fractal Dimension to deal with redundant features. We also propose a performance evaluation model based on the constructed hypothesis' error and the percentage of reduction obtained from the selected feature subset. Experimental results obtained using well known feature selection algorithms on several data sets show that our proposal is competitive with them. Another important issue related to knowledge extraction from data is the format the data is represented. Usually, it is necessary to describe examples in the so-called attribute-value format. This work also proposes a methodology to support, through a semi-automatic process, the construction of a database in the attribute-value format from patient information contained in medical findings which are described in natural language. This process was successfully applied to a real case.
|
12 |
Metodologia para mapeamento de informações não estruturadas descritas em laudos médicos para uma representação atributo-valor / A methodology for mapping non-structured medical findings to the attribute-value table formatHonorato, Daniel de Faveri 29 April 2008 (has links)
Devido à facilidade com que informações biomédicas em língua natural são registras e armazenadas no formato digital, a recuperação de informações a partir de registros de pacientes nesse formato não estruturado apresenta diversos problemas a serem solucionados. Assim, a extração de informações estruturadas (por exemplo, no formato atributo-valor) a partir de registros não estruturados é um importante problema de pesquisa. Além disso, a representação de registros médicos não estruturados no formato atributo-valor, permite a aplicação de uma grande variedade de métodos de extração de padrões. Para mapear registros médicos não estruturados no formato atributo-valor, propomos uma metodologia que pode ser utilizada para automaticamente (ou semi-automaticamente, com a ajuda de um especialista do domínio) mapear informações médicas de interesse armazenadas nos registros médicos e descritas em linguagem natural em um formato estruturado. Essa metodologia foi implementada em um sistema computacional chamado TP-DISCOVER, o qual gera uma tabela no formato atributo-valor a partir de um conjunto de registros de pacientes (documentos). De modo a identificar entidades importantes no conjunto de documentos, assim como relacionamentos significantes entre essas entidades, propomos uma abordagem de extração de terminologia híbrida (lingüística/estatística) a qual seleciona palavras e frases que aparecem com freqüência acima de um dado limiar por meio da aplicação de medidas estatísticas. A idéia geral dessa abordagem híbrida de extração de terminologia é que documentos especializados são caracterizados por repetir o uso de certas unidades léxicas ou construções morfo-sintáticas. Nosso objetivo é reduzir o esforço despendido na modelagem manual por meio da observação de regularidades no texto e o mapeamento dessas regularidades como nomes de atributos na representação atributo-valor. A metodologia proposta foi avaliada realizando a estruturação automática de uma coleção de 6000 documentos com informações de resultados de exames de Endoscopia Digestiva Alta descritos em língua natural. Os resultados experimentais, os quais podem ser considerados os piores resultados, uma vez que esses resultados poderiam ser muito melhores caso a metodologia for utilizada semi-automaticamente junto com um especialista do domínio, mostram que a metodologia proposta é adequada e permite reduzir o tempo usado pelo especialista para analisar grande quantidade de registros médicos / The information retrieval from text stored in computer-based patient records is an important open-ended research problem, as the ease in which biomedical information recorded and stored in digital form grows. Thus, means to extract structured information (for example, in the so-called attribute-value format) from free-text records is an important research endeavor. Furthermore, by representing the free-text records in the attribute-value format, available pattern extraction methods can be directly applied. To map free-text medical records into the attribute-value format, we propose a methodology that can be used to automatically (or semi-automatically, with the help of a medical expert) map the important medical information stored in patient records which are described in natural language into an structured format. This methodology has been implemented in a computational system called TP-DISCOVER, which generates a database in the attribute-value format from a set of patient records (documents). In order to identify important entities in the set of documents, as well as significant relations among these entities, we propose a hybrid linguistic/statistical terminology extraction approach which filters out words and phrases that appear with a frequency higher than a given threshold by applying statistical measures. The underlying assumption of this hybrid approach to terminology extraction is that specialized documents are characterized by repeated use of certain lexical units or morpho-syntactic constructions. Our goal is to reduce the effort spent in manual modelling by observing regularities in the texts and by mapping them into suitable attribute names in the attribute-value representation format. The proposed methodology was evaluated to automatically structure a collection of 6000 documents which contains High Digestive Endoscopies exams´ results described in natural language. The experimental results, all of which can be considered lower bound results as they would greatly improve in case the methodology is applied semi-automatically together with a medical expert, show that the proposed methodology is suitable to reduce the medical expert workload in analysing large amounts of medical records
|
13 |
"Pré-processamento de dados em aprendizado de máquina supervisionado" / "Data pre-processing for supervised machine learning"Batista, Gustavo Enrique de Almeida Prado Alves 16 May 2003 (has links)
A qualidade de dados é uma das principais preocupações em Aprendizado de Máquina - AM -cujos algoritmos são freqüentemente utilizados para extrair conhecimento durante a fase de Mineração de Dados - MD - da nova área de pesquisa chamada Descoberta de Conhecimento de Bancos de Dados. Uma vez que a maioria dos algoritmos de aprendizado induz conhecimento estritamente a partir de dados, a qualidade do conhecimento extraído é amplamente determinada pela qualidade dos dados de entrada. Diversos aspectos podem influenciar no desempenho de um sistema de aprendizado devido à qualidade dos dados. Em bases de dados reais, dois desses aspectos estão relacionados com (i) a presença de valores desconhecidos, os quais são tratados de uma forma bastante simplista por diversos algoritmos de AM, e; (ii) a diferença entre o número de exemplos, ou registros de um banco de dados, que pertencem a diferentes classes, uma vez que quando essa diferença é expressiva, sistemas de aprendizado podem ter dificuldades em aprender o conceito relacionado com a classe minoritária. O problema de tratamento de valores desconhecidos é de grande interesse prático e teórico. Em diversas aplicações é importante saber como proceder quando as informações disponíveis estão incompletas ou quando as fontes de informações se tornam indisponíveis. O tratamento de valores desconhecidos deve ser cuidadosamente planejado, caso contrário, distorções podem ser introduzidas no conhecimento induzido. Neste trabalho é proposta a utilização do algoritmo k-vizinhos mais próximos como método de imputação. Imputação é um termo que denota um procedimento que substitui os valores desconhecidos de um conjunto de dados por valores plausíveis. As análises conduzidas neste trabalho indicam que a imputação de valores desconhecidos com base no algoritmo k-vizinhos mais próximos pode superar o desempenho das estratégias internas utilizadas para tratar valores desconhecidos pelos sistemas C4.5 e CN2, bem como a imputação pela média ou moda, um método amplamente utilizado para tratar valores desconhecidos. O problema de aprender a partir de conjuntos de dados com classes desbalanceadas é de crucial importância, uma vez que esses conjuntos de dados podem ser encontrados em diversos domínios. Classes com distribuições desbalanceadas podem se constituir em um gargalo significante no desempenho obtido por sistemas de aprendizado que assumem uma distribuição balanceada das classes. Uma solução para o problema de aprendizado com distribuições desbalanceadas de classes é balancear artificialmente o conjunto de dados. Neste trabalho é avaliado o uso do método de seleção unilateral, o qual realiza uma remoção cuidadosa dos casos que pertencem à classe majoritária, mantendo os casos da classe minoritária. Essa remoção cuidadosa consiste em detectar e remover casos considerados menos confiáveis, por meio do uso de algumas heurísticas. Uma vez que não existe uma análise matemática capaz de predizer se o desempenho de um método é superior aos demais, análises experimentais possuem um papel importante na avaliação de sistema de aprendizado. Neste trabalho é proposto e implementado o ambiente computacional Discover Learning Environmnet - DLE - o qual é um em framework para desenvolver e avaliar novos métodos de pré-processamento de dados. O ambiente DLE é integrado ao projeto Discover, um projeto de pesquisa em desenvolvimento em nosso laboratório para planejamento e execução de experimentos relacionados com o uso de sistemas de aprendizado durante a fase de Mineração de dados do processo de KDD. / Data quality is a major concern in Machine Learning, which is frequently used to extract knowledge during the Data Mining phase of the relatively new research area called Knowledge Discovery from Databases - KDD. As most Machine Learning algorithms induce knowledge strictly from data, the quality of the knowledge extracted is largely determined by the quality of the underlying data. Several aspects may influence the performance of a learning system due to data quality. In real world databases, two of these aspects are related to (i) the presence of missing data, which is handled in a rather naive way by many Machine Learning algorithms; (ii) the difference between the number of examples, or database records, that belong to different classes since, when this difference is large, learning systems may have difficulties to learn the concept related to the minority class. The problem of missing data is of great practical and theoretical interest. In many applications it is important to know how to react if the available information is incomplete or if sources of information become unavailable. Missing data treatment should be carefully thought, otherwise bias might be introduced into the knowledge induced. In this work, we propose the use of the k-nearest neighbour algorithm as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Our analysis indicates that missing data imputation based on the k-nearest neighbour algorithm can outperform the internal missing data treatment strategies used by C4.5 and CN2, and the mean or mode imputation, a widely used method for treating missing values. The problem of learning from imbalanced data sets is of crucial importance since it is encountered in a large number of domains. Imbalanced class distributions might cause a significant bottleneck in the performance obtained by standard learning methods, which assume a balanced distribution of the classes. One solution to the problem of learning with skewed class distributions is to artificially balance the data set. In this work we propose the use of the one-sided selection method, which performs a careful removal of cases belonging to the majority class while leaving untouched all cases from the minority class. Such careful removal consists of detecting and removing cases considered less reliable, using some heuristics. An experimental application confirmed the efficiency of the proposed method. As there is not a mathematical analysis able to predict whether the performance of a learning system is better than others, experimentation plays an important role for evaluating learning systems. In this work we propose and implement a computational environment, the Discover Learning Environment - DLE - which is a framework to develop and evaluate new data pre-processing methods. The DLE is integrated into the Discover project, a major research project under development in our laboratory for planning and execution of experiments related to the use of learning systems during the Data Mining phase of the KDD process.
|
14 |
Spatio-Temporal Pre-Processing Methods for Region-of-Interest Video CodingKarlsson, Linda S. January 2007 (has links)
<p>In video transmission at low bit rates the challenge is to compress the video with a minimal reduction of the percieved quality. The compression can be adapted to knowledge of which regions in the video sequence are of most interest to the viewer. Region of interest (ROI) video coding uses this information to control the allocation of bits to the background and the ROI. The aim is to increase the quality in the ROI at the expense of the quality in the background. In order for this to occur the typical content of an ROI for a particular application is firstly determined and the actual detection is performed based on this information. The allocation of bits can then be controlled based on the result of the detection.</p><p>In this licenciate thesis existing methods to control bit allocation in ROI video coding are investigated. In particular pre-processing methods that are applied independently of the codec or standard. This makes it possible to apply the method directly to the video sequence without modifications to the codec. Three filters are proposed in this thesis based on previous approaches. The spatial filter that only modifies the background within a single frame and the temporal filter that uses information from the previous frame. These two filters are also combined into a spatio-temporal filter. The abilities of these filters to reduce the number of bits necessary to encode the background and to successfully re-allocate these to the ROI are investigated. In addition the computational compexities of the algorithms are analysed.</p><p>The theoretical analysis is verified by quantitative tests. These include measuring the quality using both the PSNR of the ROI and the border of the background, as well as subjective tests with human test subjects and an analysis of motion vector statistics.</p><p>The qualitative analysis shows that the spatio-temporal filter has a better coding efficiency than the other filters and it successfully re-allocates the bits from the foreground to the background. The spatio-temporal filter gives an improvement in average PSNR in the ROI of more than 1.32 dB or a reduction in bitrate of 31 % compared to the encoding of the original sequence. This result is similar to or slightly better than the spatial filter. However, the spatio-temporal filter has a better performance, since its computational complexity is lower than that of the spatial filter.</p>
|
15 |
Essays on spatial point processes and bioinformaticsFahlén, Jessica January 2010 (has links)
This thesis consists of two separate parts. The first part consists of one paper and considers problems concerning spatial point processes and the second part includes three papers in the field of bioinformatics. The first part of the thesis is based on a forestry problem of estimating the number of trees in a region by using the information in an aerial photo, showing the area covered by the trees. The positions of the trees are assumed to follow either a binomial point process or a hard-core Strauss process. Furthermore, discs of equal size are used to represent the tree-crowns. We provide formulas for the expectation and the variance of the relative vacancy for both processes. The formulas are approximate for the hard-core Strauss process. Simulations indicate that the approximations are accurate. The second part of this thesis focuses on pre-processing of microarray data. The microarray technology can be used to measure the expression of thousands of genes simultaneously in a single experiment. The technique is used to identify genes that are differentially expressed between two populations, e.g. diseased versus healthy individuals. This information can be used in several different ways, for example as diagnostic tools and in drug discovery. The microarray technique involves a number of complex experimental steps, where each step introduces variability in the data. Pre-processing aims to reduce this variation and is a crucial part of the data analysis. Paper II gives a review over several pre-processing methods. Spike-in data are used to describe how the different methods affect the sensitivity and bias of the experiment. An important step in pre-processing is dye-normalization. This normalization aims to remove the systematic differences due to the use of different dyes for coloring the samples. In Paper III a novel dye-normalization, the MC-normalization, is proposed. The idea behind this normalization is to let the channels’ individual intensities determine the correction, rather than the average intensity which is the case for the commonly used MA-normalization. Spike-in data showed that the MC-normalization reduced the bias for the differentially expressed genes compared to the MA-normalization. The standard method for preserving patient samples for diagnostic purposes is fixation in formalin followed by embedding in paraffin (FFPE). In Paper IV we used tongue-cancer microRNA-microarray data to study the effect of FFPE-storage. We suggest that the microRNAs are not equally affected by the storage time and propose a novel procedure to remove this bias. The procedure improves the ability of the analysis to detect differentially expressed microRNAs.
|
16 |
Hand Gesture Detection & Recognition SystemKhan, Muhammad January 2012 (has links)
The project introduces an application using computer vision for Hand gesture recognition. A camera records a live video stream, from which a snapshot is taken with the help of interface. The system is trained for each type of count hand gestures (one, two, three, four, and five) at least once. After that a test gesture is given to it and the system tries to recognize it.A research was carried out on a number of algorithms that could best differentiate a hand gesture. It was found that the diagonal sum algorithm gave the highest accuracy rate. In the preprocessing phase, a self-developed algorithm removes the background of each training gesture. After that the image is converted into a binary image and the sums of all diagonal elements of the picture are taken. This sum helps us in differentiating and classifying different hand gestures.Previous systems have used data gloves or markers for input in the system. I have no such constraints for using the system. The user can give hand gestures in view of the camera naturally. A completely robust hand gesture recognition system is still under heavy research and development; the implemented system serves as an extendible foundation for future work.
|
17 |
Effective Strategies for Improving Peptide Identification with Tandem Mass SpectrometryHan, Xi January 2011 (has links)
Tandem mass spectrometry (MS/MS) has been routinely used to identify peptides from protein mixtures in the field of proteomics. However, only about 30% to 40% of current MS/MS spectra can be identified, while many of them remain unassigned, even though they are of reasonable quality.
The ubiquitous presence of post-translational modifications (PTMs) is one of the reasons for current low spectral identification rate. In order to identify post-translationally modified peptides, most existing software requires the specification of a few possible modifications. However, such knowledge of possible modifications is not always available. In this thesis, we describe a new algorithm for identifying modified peptides without requiring users to specify the possible modifications before the search routine; instead, all modifications from the Unimod database are considered. Meanwhile, several new techniques are employed to avoid the exponential growth of the search space, as well as to control the false discoveries due to this unrestricted search approach. A software tool, PeaksPTM, has been developed and it has already achieved a stronger performance than competitive tools for unrestricted identification of post-translationally modified peptides.
Another important reason for the failure of the search tools is the inaccurate mass or charge state measurement of the precursor peptide ion. In this thesis, we study the precursor mono-isotopic mass and charge determination problem, and propose an algorithm to correct precursor ion mass error by assessing the isotopic features in its parent MS spectrum. The algorithm has been tested on two annotated data sets and achieved almost 100 percent accuracy. Furthermore, we have studied a more complicated problem, the MS/MS preprocessing problem, and propose a spectrum deconvolution algorithm. Experiments were provided to compare its performance with other existing software.
|
18 |
Determination of physical contaminants in wheat using hyperspectral imagingLankapalli, Ravikanth 22 April 2015 (has links)
Cereal grains are an important part of human diet; hence, there is a need to maintain high quality and these grains must be free of physical and biological contaminants. A procedure was developed to differentiate physical contaminants from wheat using NIR (1000-1600 nm) hyperspectral imaging. Three experiments were conducted to select the best combinations of spectral pre-processing technique and statistical classifier to classify physical contaminants: seven foreign material types (barley, canola, maize, flaxseed, oats, rye, and soybean); six dockage types (broken wheat kernels, buckwheat, chaff, wheat spikelets, stones, and wild oats); and two animal excreta types (deer and rabbit droppings) from Canada Western Red Spring (CWRS) wheat. These spectra were processed using five spectral pre-processing techniques (first derivative, second derivative, Savitzky-Golay (SG) smoothing and differentiation, multiplicative scatter correction (MSC), and standard normal variate (SNV)). The raw and pre-processed data were classified using Support Vector Machines (SVM), Naïve Bayes (NB), and k-nearest neighbors (k-NN) classifiers. In each experiment, two-way and multi-way classifications were conducted.
Among all the contaminant types, stones, chaff, deer droppings and rabbit droppings were classified with 100% accuracy using the raw reflectance spectra and different statistical classifiers. The SNV technique with k-NN classifier gave the highest accuracy for the classification of foreign material types from wheat (98.3±0.2%) and dockage types from wheat (98.9±0.2%). The MSC and SNV techniques with SVM or k-NN classifier gave perfect classification (100.0±0.0%) for the classification of animal excreta types from wheat. Hence, the SNV technique with k-NN classifier was selected as the best model.
Two separate model performance evaluation experiments were conducted to identify and quantify (by number) the amount of contaminant type present in wheat. The overall identification accuracy of the first degree of contamination (one contaminant type with wheat) and the highest degree of contamination (all the contaminant type with wheat) was 97.6±1.6% and 92.5±6.5%, for foreign material types; 98.0±1.8% and 94.3±6.2%r for dockage types; and 100.0±0.0% and 100.0±0.0%, respectively for animal excreta types. The canola, stones, deer, and rabbit droppings were perfectly quantified (100.0±0.0%) at all the levels of contaminations. / February 2016
|
19 |
Sintaktiese herrangskikking as voorprosessering in die ontwikkeling van Engels na Afrikaanse statistiese masjienvertaalsisteem / Marissa GrieselGriesel, Marissa January 2011 (has links)
Statistic machine translation to any of the resource scarce South African languages generally results in low quality output. Large amounts of training data are required to generate output of such a standard that it can ease the work of human translators when incorporated into a translation environment. Sufficiently large corpora often do not exist and other techniques must be researched to improve the quality of the output. One of the methods in international literature that yielded good improvements in the quality of the output applies syntactic reordering as pre-processing. This pre-processing aims at simplifying the decod-ing process as less changes will need to be made during translation in this stage. Training will also benefit since the automatic word alignments can be drawn more easily because the word orders in both the source and target languages are more similar. The pre-processing is applied to the source language training data as well as to the text that is to be translated. It is in the form of rules that recognise patterns in the tags and adapt the structure accordingly. These tags are assigned to the source language side of the aligned parallel corpus with a syntactic analyser. In this research project, the technique is adapted for translation from English to Afrikaans and deals with the reordering of verbs, modals, the past tense construct, construc-tions with “to” and negation. The goal of these rules is to change the English (source language) structure to better resemble the Afrikaans (target language) structure. A thorough analysis of the output of the base-line system serves as the starting point. The errors that occur in the output are divided into categories and each of the underlying constructs for English and Afrikaans are examined. This analysis of the output and the literature on syntax for the two languages are combined to formulate the linguistically motivated rules. The module that performs the pre-processing is evaluated in terms of the precision and the recall, and these two measures are then combined in the F-score that gives one number by which the module can be assessed. All three of these measures compare well to international standards. Furthermore, a compari-son is made between the system that is enriched by the pre-processing module and a baseline system on which no extra processing is applied. This comparison is done by automatically calculating two metrics (BLEU and NIST scores) and it shows very positive results. When evaluating the entire document, an increase in the BLEU score from 0,4968 to 0,5741 (7,7 %) and in the NIST score from 8,4515 to 9,4905 (10,4 %) is reported. / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2011.
|
20 |
Sintaktiese herrangskikking as voorprosessering in die ontwikkeling van Engels na Afrikaanse statistiese masjienvertaalsisteem / Marissa GrieselGriesel, Marissa January 2011 (has links)
Statistic machine translation to any of the resource scarce South African languages generally results in low quality output. Large amounts of training data are required to generate output of such a standard that it can ease the work of human translators when incorporated into a translation environment. Sufficiently large corpora often do not exist and other techniques must be researched to improve the quality of the output. One of the methods in international literature that yielded good improvements in the quality of the output applies syntactic reordering as pre-processing. This pre-processing aims at simplifying the decod-ing process as less changes will need to be made during translation in this stage. Training will also benefit since the automatic word alignments can be drawn more easily because the word orders in both the source and target languages are more similar. The pre-processing is applied to the source language training data as well as to the text that is to be translated. It is in the form of rules that recognise patterns in the tags and adapt the structure accordingly. These tags are assigned to the source language side of the aligned parallel corpus with a syntactic analyser. In this research project, the technique is adapted for translation from English to Afrikaans and deals with the reordering of verbs, modals, the past tense construct, construc-tions with “to” and negation. The goal of these rules is to change the English (source language) structure to better resemble the Afrikaans (target language) structure. A thorough analysis of the output of the base-line system serves as the starting point. The errors that occur in the output are divided into categories and each of the underlying constructs for English and Afrikaans are examined. This analysis of the output and the literature on syntax for the two languages are combined to formulate the linguistically motivated rules. The module that performs the pre-processing is evaluated in terms of the precision and the recall, and these two measures are then combined in the F-score that gives one number by which the module can be assessed. All three of these measures compare well to international standards. Furthermore, a compari-son is made between the system that is enriched by the pre-processing module and a baseline system on which no extra processing is applied. This comparison is done by automatically calculating two metrics (BLEU and NIST scores) and it shows very positive results. When evaluating the entire document, an increase in the BLEU score from 0,4968 to 0,5741 (7,7 %) and in the NIST score from 8,4515 to 9,4905 (10,4 %) is reported. / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2011.
|
Page generated in 0.3167 seconds