Spelling suggestions: "subject:"[een] COH-METRIX"" "subject:"[enn] COH-METRIX""
1 |
Using Coh-Metrix to Compare Cohesion Measures between the United States Senators John McCain and Barack ObamaHultgren, Annie January 2009 (has links)
<p>This investigation explores and analyzes speeches by John McCain and Barack Obama, who were the candidates of the United States Presidential election 2008. Ten speeches by each speaker are in a non-biased way selected from the year 2007 from their official websites when they were senators of Arizona and Illinois respectively. The analyses of the speeches concern cohesive measures and are not about what they say in their political occupation. This approach was selected to see if there are any comparisons and/or contrasts in terms of cohesion between the speakers or within their own set of speeches. The website Coh-Metrix has been used and out of it nine measures have been selected and analyzed in detail. This study looks at the average words per sentence, the average syllables per word, the Flesch Reading Ease score, the average concreteness for content words, the average minimum concreteness for content words, the mean number of higher level constituents, the type-token ratio, the syntactic structure similarity, and the average number of negations. The two speakers had overall very similar results except for a few standard deviations as in for example the average concreteness and average minimum concreteness for content words results. However, eight out of the nine measurement numbers were non-significant according to a t-test for non-matched observations and/or a chi-square goodness-of-fit test. One measurement, the average number of negation expressions per 1000 words, was nonetheless highly significant according to a t-test and chi-square test, as Obama used about twice as many negations in comparison to McCain. This study shows that the speakers’ twenty speeches are similar in terms of structure and cohesion except for the fact that Obama uses more negation expressions compared to McCain. These results do not, however, necessarily say anything else about the speakers and/or speeches.</p><p>Keywords: cohesion, cohesion markers, cohesion measures/measurements, Coh-Metrix, speeches, texts</p>
|
2 |
Assessing English Proficiency in the Age of Globalization : The Case of IELTS / Att bedöma engelsk språkkunskap i globaliseringens tid : i fallet IELTSMehraban, Bahman January 2022 (has links)
This study investigates the use and relevance of the most widely used test of English proficiency in the era of globalization, i.e. IELTS. It discusses different versions of the test along with their corresponding components. General Training IELTS Writing Task 2 is the central focus of the thesis, and as the result of the test directly impacts the future trajectories of the test takers' professions, the issue is of real life relevance and significance. The study concentrates on three of the scoring criteria included in the General Training IELTS writing rubric which are used to assess test-takers’ achievements on cohesion and coherence, lexical diversity, and grammatical range and accuracy. As the rubric attributes increases in these three measures solely to higher levels of proficiency in English and ignores the impact of such factors as age of the test-takers and their level of education, keeping proficiency level constant can be expected to result in a uniform performance on those measures. For this reason, groups of English native speakers (as homogenous groups with the highest possible level of proficiency) constituted the participants in this study who contributed to the written texts which were later analyzed as the primary data. It was attempted to see if they would exhibit invariably equal levels of performance on those three areas in their writing. In addition, the IELTS website openly states the so-called Standard English varieties spoken by native speakers in Britain, Canada, US, Australia, and New Zealand as the basis of its proficiency assessment. Analyzing texts written by groups of native speakers from those countries could verify the proposed validity of such claims. The data used in this study consisted of texts written by English native speakers retrieved from the LOCNESS Corpus as well as some texts collected through two especially designed Google Forms. The texts later underwent automated textual analysis using an online program called Coh-Metrix to collect their textual features in the form of numeric indices on the areas of cohesion and coherence, lexical diversity, and grammatical range and accuracy. Those indices were then analyzed for statistically significant differences using such statistical procedures as ANOVA analyses, t-tests, and correlation studies. Findings indicated significant differences among different groups of English native speakers on the investigated measures. It is argued that IELTS not only fails to capture the contemporary realities around the use of English worldwide but also assesses test takers on expectations which are not met even by English native speakers. Due to the diverse use of English in professional as well as daily encounters in the age of globalization, it is most probable that a single way of assessing English proficiency fails to meet the requirements of every single context of English use, and it is more justified for recruiting organizations to base their proficiency assessment on more locally-defined norms and practices.
|
3 |
[en] A PSYCHOLINGUISTIC INVESTIGATION OF WRITING IN L1 AND L2: A STUDY WITH ENGLISH TEACHERS / [pt] UMA INVESTIGAÇÃO PSICOLINGUÍSTICA DA ESCRITA EM L1 E L2: UM ESTUDO COM PROFESSORES DE INGLÊSRACHEL DA COSTA MURICY 23 November 2023 (has links)
[pt] A presente dissertação aborda a escrita bilíngue – Português como L1 e Inglês como
L2, a partir de uma perspectiva cognitiva, com vistas a buscar caracterizar, de forma
integrada, o processo e o produto da escrita, e possíveis correlações entre
desempenho em escrita e aspectos atencionais. Participam da pesquisa 15
professores de língua inglesa (10 mulheres e 5 homens), idade média de 43,5 anos
(DP 13,25), nativos do Português brasileiro. No estudo, foram empregadas
ferramentas computacionais que possibilitam o registro das ações de escrita no
curso da produção textual de textos argumentativos (programa Inputlog), a análise
automática de características linguísticas do texto final (Nilc-Metrix (L1) e Coh-Metrix (L2) e a verificação de padrões de conectividade no texto final, por meio de
atributos de grafos (SpeechGraphs). Adotou-se também o teste ANT - Attention
Network Test com o intuito de ampliar a reflexão a respeito de fatores cognitivos e
possíveis influências na produção textual. Na análise do processo de escrita, foram
examinados tanto padrões de pausa como operações de escrita ativa e ações de
revisão (inserções e apagamentos). Na análise do produto, consideraram-se
parâmetros ligados a aspectos vocabulares, semânticos, sintáticos e índices de
legibilidade, e informações sobre recorrência lexical e conectividade entre palavras.
No que tange ao processo, os resultados do estudo revelaram diferenças entre as
duas línguas, com valores mais altos associados à escrita em Inglês, para (i) pausas
no interior de palavras - possivelmente sinalizando uma demanda de ordem
ortográfica - e (ii) percentual de escrita ininterrupta, indicando uma escrita com
menos interrupções, com menor número de alterações/revisões. O estudo de
correlação revelou que os participantes apresentam o mesmo perfil de escrita na L1
e na L2. Na análise do produto por meio do Coh-Metrix (Inglês) e Nilc-Metrix
(Português), verificou-se, por meio de índice de legibilidade, que os textos
apresentam complexidade moderada nas duas línguas. A despeito de diferenças em
como as métricas são definidas em cada Programa, os resultados sugerem que os
textos em Português apresentam graus de complexidade que se correlacionam com
aspectos sintáticos (como número de palavras antes do verbo principal e índice de
Flesch) e semânticos (grau de concretude). Na L2, destaca-se que a diversidade
lexical permanece sendo um dos indicadores mais confiáveis de proficiência e graus
de complexidade, correlacionando-se com comportamentos de pausas (antes de
palavras) e revisão (normal production). Em relação ao SpeechGraphs, foram
observadas diferenças significativas entre os textos na L1 e na L2 para quase todos
os atributos de grafos analisados, o que é interpretado como um reflexo da forma
como o programa lida com características morfológicas das duas línguas. Não
foram observadas correlações entre o comportamento dos falantes na L1 e na L2.
Foram ainda conduzidos estudos de correlação entre os dados do Inputlog e os das
ferramentas Coh-Metrix e Nilc-Metrix e entre estas e os dados do SpeechGraphs.
Nos dois estudos, observou-se uma correspondência entre parâmetros indicativos
de complexidade das ferramentas utilizadas, sugerindo um caminho relevante de
exploração de análise integrada processo-produto para trabalhos futuros. Em
relação ao estudo de correlação entre dados do Inputlog e do ANT, destacaram-se
as correlações entre acurácia e tempo de reação nas condições experimentais e os
percentuais de apagamentos. Os presentes achados abrem caminho e trazem
contribuições significativas para o campo da psicolinguística no âmbito da pesquisa
entre L1 e L2. / [en] This dissertation addresses bilingual writing – Portuguese as L1 and English as L2
– from a cognitive perspective, aiming to characterize both the writing process and
the final product in an integrated manner and explore correlations between writing
performance and attentional aspects. The research involves 15 English language
teachers (10 women and 5 men) with an average age of 43.5 years (SD 13.25),
native speakers of Brazilian Portuguese. The study utilized computational tools to
record writing actions during the production of argumentative texts (Inputlog
program), automatically analyzed linguistic aspects from the text (Nilc-Metrix
program for Portuguese and Coh-Metrix for English) and verify connectivity
patterns in the final text using graph attributes (SpeechGraphs program). The
Attention Network Test (ANT) was also adopted. In the analysis of the writing
process, patterns of pauses, active writing operations, and revision actions
(insertions and deletions) were examined. In the product analysis, parameters
related to vocabulary, semantics, syntax, readability índices, as well as information
on lexical recurrence and word connectivity, were considered. Regarding the
writing process, the results of the study revealed differences between the two
languages, with higher values associated with writing in English, particularly in
terms of (i) pauses within words, indicating orthographic demands, and (ii) the
percentage of uninterrupted writing, suggesting less interruption and fewer
alterations/revisions. Correlation analysis indicated that participants exhibited a
similar writing profile in both L1 and L2. In the product analysis using Coh-Metrix
(English) and Nilc-Metrix (Portuguese), it was found, through readability índices,
that the texts exhibited moderate complexity in both languages. Despite differences
in how metrics are defined in each program, the results suggest that texts in
Portuguese show a higher level of complexity when considering syntactic aspects
(such as the number of words before main verbs) and semantic aspects
(concreteness degree). For L2, lexical diversity remains one of the most reliable
proficiency indicators, correlating with pause behavior (before words) and revision
(normal production). Regarding SpeechGraphs, significant differences were
observed between texts in L1 and L2 for almost all analyzed graph attributes,
reflecting how the program deals with morphological characteristics of the two
languages. No correlations were observed between the behavior of speakers in L1
and L2. Additionally, correlation studies were conducted between Inputlog data and
Coh-Metrix and Nilc-Metrix tools, as well as between these tools and Speech Graph
data. In both studies, a correspondence was observed between parameters indicative
of complexity in the tools used, suggesting a relevant path for exploring integrated
process-product analysis in future research. Regarding the correlation study
between Inputlog and ANT data, notable correlations emerged between accuracy
and reaction time in experimental conditions and percentages of deletions. These
findings pave the way for significant contributions to the field of psycholinguistics
in the context of research between L1 and L2.
|
4 |
Using Coh-Metrix to Compare Cohesion Measures between the United States Senators John McCain and Barack ObamaHultgren, Annie January 2009 (has links)
This investigation explores and analyzes speeches by John McCain and Barack Obama, who were the candidates of the United States Presidential election 2008. Ten speeches by each speaker are in a non-biased way selected from the year 2007 from their official websites when they were senators of Arizona and Illinois respectively. The analyses of the speeches concern cohesive measures and are not about what they say in their political occupation. This approach was selected to see if there are any comparisons and/or contrasts in terms of cohesion between the speakers or within their own set of speeches. The website Coh-Metrix has been used and out of it nine measures have been selected and analyzed in detail. This study looks at the average words per sentence, the average syllables per word, the Flesch Reading Ease score, the average concreteness for content words, the average minimum concreteness for content words, the mean number of higher level constituents, the type-token ratio, the syntactic structure similarity, and the average number of negations. The two speakers had overall very similar results except for a few standard deviations as in for example the average concreteness and average minimum concreteness for content words results. However, eight out of the nine measurement numbers were non-significant according to a t-test for non-matched observations and/or a chi-square goodness-of-fit test. One measurement, the average number of negation expressions per 1000 words, was nonetheless highly significant according to a t-test and chi-square test, as Obama used about twice as many negations in comparison to McCain. This study shows that the speakers’ twenty speeches are similar in terms of structure and cohesion except for the fact that Obama uses more negation expressions compared to McCain. These results do not, however, necessarily say anything else about the speakers and/or speeches. Keywords: cohesion, cohesion markers, cohesion measures/measurements, Coh-Metrix, speeches, texts
|
5 |
An Analysis of Readability of Standard Measurements in English Textbooks by Swedish Publishers from the 90’s to 2016Leander, Mia January 2016 (has links)
Teachers employ traditional readability measurements to estimate text difficulty when assigning textbooks to meet the students’ current proficiency level. The purpose of this study was to see if modern textbooks published by five major publishing companies in Sweden were more difficult to read according to traditional readability formula compared to textbooks published in the 90’s. This study also aims to investigate whether readability in textbooks was similar among the five publishers. The readability formula Flesch Reading Ease in Coh-Metrix was utilized to calculate text difficulty in textbooks intended for Swedish grade 7. The primary material used for this study consisted of 70 texts selected from textbooks from three different time periods and five different publishers. The results from the formula indicated that the modern textbooks, published in between 2012-2016, were more difficult to read compared with the older ones. In addition, the results indicated that readability was similar among the different publishers. However, the study showed that modern textbooks published by Liber were easier to read in comparison with the older ones of this particular publisher. That modern textbooks have longer sentence structures and more complex syntax suggests increased expectations of 7th graders’ reading abilities.
|
6 |
The Interplay of Text Complexity and Cohesion : Exploring and Analyzing Differences Across Levels of Readability in Easy-to-Read TextBrissman, Wilgot January 2024 (has links)
When assessing the readability of a text it is helpful to consider all its interacting elements. This includes its syntactic complexity, but other aspects, such as that of cohesion, are no less important. The thesis explores how these are reflected in each other and in the readability of books in a dataset provided by the publisher Nypon och Vilja, which consists of easy-to-read books divided into six levels of readability. To provide additional nuance, the interrelated concepts of epistemic stance and narrativity are introduced for the purpose of deepening the analysis of the statistical findings. They also prove useful in further discussion surrounding complexity and cohesion as they relate to reading skill and knowledge asymmetries. Principal component analysis (PCA) is employed to uncover these statistical relationships on a broader scale, though more specific in-depth analysis are performed relating to certain metrics. While the findings have some support in literature, re-affirming the importance of narrativity for contextualizing cohesion, the clear link between higher complexity and less narrative text was not expected. Furthermore, the PCA indicates a more nuanced picture of referential cohesion and the use of its constituent metrics, depending both on narrativity and complexity.
|
7 |
Comportamento de Metricas de Inteligibilidade Textual em Documentos Recuperados naWeb / THE BEHAVIOR OF READABILITY METRICS IN DOCUMENTS RETRIEVED IN INTERNET AND ITS USE AS AN INFORMATION RETRIEVAL QUERY PARAMETERLondero, Eduardo Bauer 29 March 2011 (has links)
Made available in DSpace on 2016-03-22T17:26:45Z (GMT). No. of bitstreams: 1
Dissertacao_Eduardo_Revisado.pdf: 3489154 bytes, checksum: 3c327ee0bc47d79cd4af46e065105650 (MD5)
Previous issue date: 2011-03-29 / Text retrieved from the Internet through Google and Yahoo queries are evaluated
using Flesch-Kincaid Grade Level, a simple assessment measure of text readability. This
kind of metrics were created to help writers to evaluate their text, and recently in automatic
text simplification for undercapable readers. In this work we apply these metrics
to documents freely retrieved from the Internet, seeking to find correlations between legibility
and relevance acknowledged to then by search engines. The initial premise guiding
the comparison between readability and relevance is the statement known as Occam s
Principle, or Principle of Economy. This study employs Flesch-Kincaid Grade Level in
text documents retrieved from the Internet through search-engines queries and correlate
it with the position. It was found a centralist trend in the texts recovered. The centralist
tendency mean that the average spacing of groups of files from the average of the
category they belong is meaningfull. With this measure is possible to establish a correlation
between relevance and legibility, and also, to detect diferences in the way both
search engines derive their relevance calculation. A subsequent experiment seeks to determine
whether the measure of legibility can be employed to assist him or her choosing
a document combined with original search engine ranking and if it is useful as advance
information for choice and user navigation. In a final experiment, based on previously
obtained knowledge, a comparison between Wikipedia and Britannica encyclopedias by
employing the metric of understandability Flesch-Kincaid / Textos recuperados da Internet por interm´edio de consultas ao Google e Yahoo
s ao analisados segundo uma m´etrica simples de avaliac¸ ao de inteligibilidade textual. Tais
m´etricas foram criadas para orientar a produc¸ ao textual e recentemente tamb´em foram
empregadas em simplificadores textuais autom´aticos experimentais para leitores inexperientes.
Nesse trabalho aplicam-se essas m´etricas a texto originais livres, recuperados da
Internet, para buscar correlacionar o grau de inteligibilidade textual com a relev ancia que
lhes ´e conferida pelos buscadores utilizados. A premissa inicial a estimular a comparac¸ ao
entre inteligibilidade e relev ancia ´e o enunciado conhecido como Princ´ıpio de Occam,
ou princ´ıpio da economia. Observa-se uma tend encia centralista que ocorre a partir do
pequeno afastamento m´edio dos grupos de arquivos melhor colocados no ranking em
relac¸ ao `a m´edia da categoria a que pertencem. ´E com a medida do afastamento m´edio que
se consegue verificar correlac¸ ao com a posic¸ ao do arquivo no ranking e ´e tamb´em com
essa medida que se consegue registrar diferenc¸as entre o m´etodo de calcular a relev ancia
do Google e do Yahoo. Um experimento que decorre do primeiro estudo procura determinar
se a medida de inteligibilidade pode ser empregada para auxiliar o usu´ario da Internet
a escolher arquivos mais simples ou se a sua indicac¸ ao junto `a listagem de links recuperados
´e ´util e informativa para a escolha e navegac¸ ao do usu´ario. Em um experimento
final, embasado no conhecimento previamente obtido, s ao comparadas as enciclop´edias
Brit anica eWikip´edia por meio do emprego da m´etrica de inteligibilidade Flesch-Kincaid
Grade Level
|
8 |
Processamento de língua natural e níveis de proficiência do português : um estudo de produções textuais do exame Celpe-BrasEvers, Aline January 2013 (has links)
Este trabalho trata dos temas da proficiência em português como língua adicional e da detecção de padrões lexicais e coesivos a partir de um enfoque computacional, situando o tema em meio à descrição de textos produzidos no contexto do exame de proficiência Celpe- Bras de 2006-1. Fazendo uso de pressupostos teórico-metodológicos da Linguística de Corpus, da Linguística Textual e do Processamento de Língua Natural, investigou-se a hipótese de que seria possível classificar, de modo automático, textos submetidos ao exame conforme níveis de proficiência pré-estabelecidos. Por meio do processamento de 177 textos previamente avaliados por corretores humanos em seis níveis (Iniciante, Básico, Intermediário, Intermediário Superior, Avançado e Avançado Superior), usou-se o Aprendizado de Máquina (AM) supervisionado para cotejar padrões lexicais e coesivos capazes de distinguir os níveis sob estudo. Para o cotejo dos padrões, a ferramenta Coh-Metrix-Port – que calcula parâmetros de coesão, coerência e inteligibilidade textual – foi utilizada. Cada um dos textos foi processado na ferramenta; para o AM, os resultados da ferramenta Coh-Metrix-Port foram usados como atributos, os níveis de proficiência como classes e os textos como instâncias. As etapas de processamento do corpus foram: 1) digitação do corpus; 2) processamento individual dos textos na ferramenta Coh-Metrix-Port; 3) análise usando AM – Algoritmo J48 – e os seis níveis de proficiência; 4) nova análise usando AM e duas novas classes: textos sem certificação (Iniciante e Básico) e com certificação (Intermediário, Intermediário Superior, Avançado e Avançado Superior). Avançado e Avançado Superior). Apesar do tamanho reduzido do corpus, foi possível identificar os seguintes atributos distintivos entre os textos da amostra: número de palavras, medida de riqueza lexical, número de parágrafos, incidência de conectivos negativos, incidência de adjetivos e Índice Flesch. Chegou-se a um classificador capaz de separar dois conjuntos de texto (SEM e COM CERTIFICAÇÃO) através das métricas utilizadas (fmeasure de 70%). / This research analyzes Portuguese proficiency from a computational perspective, studying texts submitted to the Brazilian Portuguese proficiency exam Celpe-Bras (Certificate of Proficiency in Portuguese for Foreigners). The study was based on Corpus Linguistics, Textual Linguistics, and Natural Language Processing. We investigated the hypothesis that it would be possible to predict second language proficiency using Machine Learning (ML), measures given by a NLP tool (Coh-Metrix-Port), and a corpus of texts previously classified by human raters. The texts (177) were previously classified as Beginner, Elementary, Intermediate, Upper Intermediate, Advanced, and Upper Advanced. After preparation, they were processed by Coh-Metrix-Port, a tool that calculates cohesion, coherence, and textual readability at different linguistic levels. The output of this tool provided 48 measures that were used as attributes, the proficiency levels given by raters were considered classes, and the 177 were considered instances for ML purposes. The algorithm J48 was used with this set of texts, providing a Decision Tree that classified the six levels of proficiency. The results for this analysis were not conclusive; because of that, we performed a new analysis with a new set of texts: two classes, one with texts that did not receive certificate (Beginner and Elementary) and the other with texts that did receive the certificate (Intermediate, Upper Intermediate, Advanced, and Upper Advanced). Despite the small size of the corpus, we were able to identify the following distinguishing attributes: number of words, type token ratio, number of paragraphs, incidence of negative connectives, incidence of adjectives, and Flesch Index. The classifier was able to separate these two last sets of texts with a F-measure of 70%.
|
9 |
Processamento de língua natural e níveis de proficiência do português : um estudo de produções textuais do exame Celpe-BrasEvers, Aline January 2013 (has links)
Este trabalho trata dos temas da proficiência em português como língua adicional e da detecção de padrões lexicais e coesivos a partir de um enfoque computacional, situando o tema em meio à descrição de textos produzidos no contexto do exame de proficiência Celpe- Bras de 2006-1. Fazendo uso de pressupostos teórico-metodológicos da Linguística de Corpus, da Linguística Textual e do Processamento de Língua Natural, investigou-se a hipótese de que seria possível classificar, de modo automático, textos submetidos ao exame conforme níveis de proficiência pré-estabelecidos. Por meio do processamento de 177 textos previamente avaliados por corretores humanos em seis níveis (Iniciante, Básico, Intermediário, Intermediário Superior, Avançado e Avançado Superior), usou-se o Aprendizado de Máquina (AM) supervisionado para cotejar padrões lexicais e coesivos capazes de distinguir os níveis sob estudo. Para o cotejo dos padrões, a ferramenta Coh-Metrix-Port – que calcula parâmetros de coesão, coerência e inteligibilidade textual – foi utilizada. Cada um dos textos foi processado na ferramenta; para o AM, os resultados da ferramenta Coh-Metrix-Port foram usados como atributos, os níveis de proficiência como classes e os textos como instâncias. As etapas de processamento do corpus foram: 1) digitação do corpus; 2) processamento individual dos textos na ferramenta Coh-Metrix-Port; 3) análise usando AM – Algoritmo J48 – e os seis níveis de proficiência; 4) nova análise usando AM e duas novas classes: textos sem certificação (Iniciante e Básico) e com certificação (Intermediário, Intermediário Superior, Avançado e Avançado Superior). Avançado e Avançado Superior). Apesar do tamanho reduzido do corpus, foi possível identificar os seguintes atributos distintivos entre os textos da amostra: número de palavras, medida de riqueza lexical, número de parágrafos, incidência de conectivos negativos, incidência de adjetivos e Índice Flesch. Chegou-se a um classificador capaz de separar dois conjuntos de texto (SEM e COM CERTIFICAÇÃO) através das métricas utilizadas (fmeasure de 70%). / This research analyzes Portuguese proficiency from a computational perspective, studying texts submitted to the Brazilian Portuguese proficiency exam Celpe-Bras (Certificate of Proficiency in Portuguese for Foreigners). The study was based on Corpus Linguistics, Textual Linguistics, and Natural Language Processing. We investigated the hypothesis that it would be possible to predict second language proficiency using Machine Learning (ML), measures given by a NLP tool (Coh-Metrix-Port), and a corpus of texts previously classified by human raters. The texts (177) were previously classified as Beginner, Elementary, Intermediate, Upper Intermediate, Advanced, and Upper Advanced. After preparation, they were processed by Coh-Metrix-Port, a tool that calculates cohesion, coherence, and textual readability at different linguistic levels. The output of this tool provided 48 measures that were used as attributes, the proficiency levels given by raters were considered classes, and the 177 were considered instances for ML purposes. The algorithm J48 was used with this set of texts, providing a Decision Tree that classified the six levels of proficiency. The results for this analysis were not conclusive; because of that, we performed a new analysis with a new set of texts: two classes, one with texts that did not receive certificate (Beginner and Elementary) and the other with texts that did receive the certificate (Intermediate, Upper Intermediate, Advanced, and Upper Advanced). Despite the small size of the corpus, we were able to identify the following distinguishing attributes: number of words, type token ratio, number of paragraphs, incidence of negative connectives, incidence of adjectives, and Flesch Index. The classifier was able to separate these two last sets of texts with a F-measure of 70%.
|
10 |
Processamento de língua natural e níveis de proficiência do português : um estudo de produções textuais do exame Celpe-BrasEvers, Aline January 2013 (has links)
Este trabalho trata dos temas da proficiência em português como língua adicional e da detecção de padrões lexicais e coesivos a partir de um enfoque computacional, situando o tema em meio à descrição de textos produzidos no contexto do exame de proficiência Celpe- Bras de 2006-1. Fazendo uso de pressupostos teórico-metodológicos da Linguística de Corpus, da Linguística Textual e do Processamento de Língua Natural, investigou-se a hipótese de que seria possível classificar, de modo automático, textos submetidos ao exame conforme níveis de proficiência pré-estabelecidos. Por meio do processamento de 177 textos previamente avaliados por corretores humanos em seis níveis (Iniciante, Básico, Intermediário, Intermediário Superior, Avançado e Avançado Superior), usou-se o Aprendizado de Máquina (AM) supervisionado para cotejar padrões lexicais e coesivos capazes de distinguir os níveis sob estudo. Para o cotejo dos padrões, a ferramenta Coh-Metrix-Port – que calcula parâmetros de coesão, coerência e inteligibilidade textual – foi utilizada. Cada um dos textos foi processado na ferramenta; para o AM, os resultados da ferramenta Coh-Metrix-Port foram usados como atributos, os níveis de proficiência como classes e os textos como instâncias. As etapas de processamento do corpus foram: 1) digitação do corpus; 2) processamento individual dos textos na ferramenta Coh-Metrix-Port; 3) análise usando AM – Algoritmo J48 – e os seis níveis de proficiência; 4) nova análise usando AM e duas novas classes: textos sem certificação (Iniciante e Básico) e com certificação (Intermediário, Intermediário Superior, Avançado e Avançado Superior). Avançado e Avançado Superior). Apesar do tamanho reduzido do corpus, foi possível identificar os seguintes atributos distintivos entre os textos da amostra: número de palavras, medida de riqueza lexical, número de parágrafos, incidência de conectivos negativos, incidência de adjetivos e Índice Flesch. Chegou-se a um classificador capaz de separar dois conjuntos de texto (SEM e COM CERTIFICAÇÃO) através das métricas utilizadas (fmeasure de 70%). / This research analyzes Portuguese proficiency from a computational perspective, studying texts submitted to the Brazilian Portuguese proficiency exam Celpe-Bras (Certificate of Proficiency in Portuguese for Foreigners). The study was based on Corpus Linguistics, Textual Linguistics, and Natural Language Processing. We investigated the hypothesis that it would be possible to predict second language proficiency using Machine Learning (ML), measures given by a NLP tool (Coh-Metrix-Port), and a corpus of texts previously classified by human raters. The texts (177) were previously classified as Beginner, Elementary, Intermediate, Upper Intermediate, Advanced, and Upper Advanced. After preparation, they were processed by Coh-Metrix-Port, a tool that calculates cohesion, coherence, and textual readability at different linguistic levels. The output of this tool provided 48 measures that were used as attributes, the proficiency levels given by raters were considered classes, and the 177 were considered instances for ML purposes. The algorithm J48 was used with this set of texts, providing a Decision Tree that classified the six levels of proficiency. The results for this analysis were not conclusive; because of that, we performed a new analysis with a new set of texts: two classes, one with texts that did not receive certificate (Beginner and Elementary) and the other with texts that did receive the certificate (Intermediate, Upper Intermediate, Advanced, and Upper Advanced). Despite the small size of the corpus, we were able to identify the following distinguishing attributes: number of words, type token ratio, number of paragraphs, incidence of negative connectives, incidence of adjectives, and Flesch Index. The classifier was able to separate these two last sets of texts with a F-measure of 70%.
|
Page generated in 0.3163 seconds