• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 5
  • 4
  • Tagged with
  • 9
  • 9
  • 5
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Using Coh-Metrix to Compare Cohesion Measures between the United States Senators John McCain and Barack Obama

Hultgren, Annie January 2009 (has links)
<p>This investigation explores and analyzes speeches by John McCain and Barack Obama, who were the candidates of the United States Presidential election 2008. Ten speeches by each speaker are in a non-biased way selected from the year 2007 from their official websites when they were senators of Arizona and Illinois respectively. The analyses of the speeches concern cohesive measures and are not about what they say in their political occupation. This approach was selected to see if there are any comparisons and/or contrasts in terms of cohesion between the speakers or within their own set of speeches. The website Coh-Metrix has been used and out of it nine measures have been selected and analyzed in detail. This study looks at the average words per sentence, the average syllables per word, the Flesch Reading Ease score, the average concreteness for content words, the average minimum concreteness for content words, the mean number of higher level constituents, the type-token ratio, the syntactic structure similarity, and the average number of negations. The two speakers had overall very similar results except for a few standard deviations as in for example the average concreteness and average minimum concreteness for content words results. However, eight out of the nine measurement numbers were non-significant according to a t-test for non-matched observations and/or a chi-square goodness-of-fit test. One measurement, the average number of negation expressions per 1000 words, was nonetheless highly significant according to a t-test and chi-square test, as Obama used about twice as many negations in comparison to McCain. This study shows that the speakers’ twenty speeches are similar in terms of structure and cohesion except for the fact that Obama uses more negation expressions compared to McCain. These results do not, however, necessarily say anything else about the speakers and/or speeches.</p><p>Keywords: cohesion, cohesion markers, cohesion measures/measurements, Coh-Metrix, speeches, texts</p>
2

Assessing English Proficiency in the Age of Globalization : The Case of IELTS / Att bedöma engelsk språkkunskap i globaliseringens tid : i fallet IELTS

Mehraban, Bahman January 2022 (has links)
This study investigates the use and relevance of the most widely used test of English proficiency in the era of globalization, i.e. IELTS. It discusses different versions of the test along with their corresponding components. General Training IELTS Writing Task 2 is the central focus of the thesis, and as the result of the test directly impacts the future trajectories of the test takers' professions, the issue is of real life relevance and significance. The study concentrates on three of the scoring criteria included in the General Training IELTS writing rubric which are used to assess test-takers’ achievements on cohesion and coherence, lexical diversity, and grammatical range and accuracy. As the rubric attributes increases in these three measures solely to higher levels of proficiency in English and ignores the impact of such factors as age of the test-takers and their level of education, keeping proficiency level constant can be expected to result in a uniform performance on those measures. For this reason, groups of English native speakers (as homogenous groups with the highest possible level of proficiency) constituted the participants in this study who contributed to the written texts which were later analyzed as the primary data. It was attempted to see if they would exhibit invariably equal levels of performance on those three areas in their writing. In addition, the IELTS website openly states the so-called Standard English varieties spoken by native speakers in Britain, Canada, US, Australia, and New Zealand as the basis of its proficiency assessment. Analyzing texts written by groups of native speakers from those countries could verify the proposed validity of such claims.   The data used in this study consisted of texts written by English native speakers retrieved from the LOCNESS Corpus as well as some texts collected through two especially designed Google Forms. The texts later underwent automated textual analysis using an online program called Coh-Metrix to collect their textual features in the form of numeric indices on the areas of cohesion and coherence, lexical diversity, and grammatical range and accuracy. Those indices were then analyzed for statistically significant differences using such statistical procedures as ANOVA analyses, t-tests, and correlation studies. Findings indicated significant differences among different groups of English native speakers on the investigated measures. It is argued that IELTS not only fails to capture the contemporary realities around the use of English worldwide but also assesses test takers on expectations which are not met even by English native speakers. Due to the diverse use of English in professional as well as daily encounters in the age of globalization, it is most probable that a single way of assessing English proficiency fails to meet the requirements of every single context of English use, and it is more justified for recruiting organizations to base their proficiency assessment on more locally-defined norms and practices.
3

[en] A PSYCHOLINGUISTIC INVESTIGATION OF WRITING IN L1 AND L2: A STUDY WITH ENGLISH TEACHERS / [pt] UMA INVESTIGAÇÃO PSICOLINGUÍSTICA DA ESCRITA EM L1 E L2: UM ESTUDO COM PROFESSORES DE INGLÊS

RACHEL DA COSTA MURICY 23 November 2023 (has links)
[pt] A presente dissertação aborda a escrita bilíngue – Português como L1 e Inglês como L2, a partir de uma perspectiva cognitiva, com vistas a buscar caracterizar, de forma integrada, o processo e o produto da escrita, e possíveis correlações entre desempenho em escrita e aspectos atencionais. Participam da pesquisa 15 professores de língua inglesa (10 mulheres e 5 homens), idade média de 43,5 anos (DP 13,25), nativos do Português brasileiro. No estudo, foram empregadas ferramentas computacionais que possibilitam o registro das ações de escrita no curso da produção textual de textos argumentativos (programa Inputlog), a análise automática de características linguísticas do texto final (Nilc-Metrix (L1) e Coh-Metrix (L2) e a verificação de padrões de conectividade no texto final, por meio de atributos de grafos (SpeechGraphs). Adotou-se também o teste ANT - Attention Network Test com o intuito de ampliar a reflexão a respeito de fatores cognitivos e possíveis influências na produção textual. Na análise do processo de escrita, foram examinados tanto padrões de pausa como operações de escrita ativa e ações de revisão (inserções e apagamentos). Na análise do produto, consideraram-se parâmetros ligados a aspectos vocabulares, semânticos, sintáticos e índices de legibilidade, e informações sobre recorrência lexical e conectividade entre palavras. No que tange ao processo, os resultados do estudo revelaram diferenças entre as duas línguas, com valores mais altos associados à escrita em Inglês, para (i) pausas no interior de palavras - possivelmente sinalizando uma demanda de ordem ortográfica - e (ii) percentual de escrita ininterrupta, indicando uma escrita com menos interrupções, com menor número de alterações/revisões. O estudo de correlação revelou que os participantes apresentam o mesmo perfil de escrita na L1 e na L2. Na análise do produto por meio do Coh-Metrix (Inglês) e Nilc-Metrix (Português), verificou-se, por meio de índice de legibilidade, que os textos apresentam complexidade moderada nas duas línguas. A despeito de diferenças em como as métricas são definidas em cada Programa, os resultados sugerem que os textos em Português apresentam graus de complexidade que se correlacionam com aspectos sintáticos (como número de palavras antes do verbo principal e índice de Flesch) e semânticos (grau de concretude). Na L2, destaca-se que a diversidade lexical permanece sendo um dos indicadores mais confiáveis de proficiência e graus de complexidade, correlacionando-se com comportamentos de pausas (antes de palavras) e revisão (normal production). Em relação ao SpeechGraphs, foram observadas diferenças significativas entre os textos na L1 e na L2 para quase todos os atributos de grafos analisados, o que é interpretado como um reflexo da forma como o programa lida com características morfológicas das duas línguas. Não foram observadas correlações entre o comportamento dos falantes na L1 e na L2. Foram ainda conduzidos estudos de correlação entre os dados do Inputlog e os das ferramentas Coh-Metrix e Nilc-Metrix e entre estas e os dados do SpeechGraphs. Nos dois estudos, observou-se uma correspondência entre parâmetros indicativos de complexidade das ferramentas utilizadas, sugerindo um caminho relevante de exploração de análise integrada processo-produto para trabalhos futuros. Em relação ao estudo de correlação entre dados do Inputlog e do ANT, destacaram-se as correlações entre acurácia e tempo de reação nas condições experimentais e os percentuais de apagamentos. Os presentes achados abrem caminho e trazem contribuições significativas para o campo da psicolinguística no âmbito da pesquisa entre L1 e L2. / [en] This dissertation addresses bilingual writing – Portuguese as L1 and English as L2 – from a cognitive perspective, aiming to characterize both the writing process and the final product in an integrated manner and explore correlations between writing performance and attentional aspects. The research involves 15 English language teachers (10 women and 5 men) with an average age of 43.5 years (SD 13.25), native speakers of Brazilian Portuguese. The study utilized computational tools to record writing actions during the production of argumentative texts (Inputlog program), automatically analyzed linguistic aspects from the text (Nilc-Metrix program for Portuguese and Coh-Metrix for English) and verify connectivity patterns in the final text using graph attributes (SpeechGraphs program). The Attention Network Test (ANT) was also adopted. In the analysis of the writing process, patterns of pauses, active writing operations, and revision actions (insertions and deletions) were examined. In the product analysis, parameters related to vocabulary, semantics, syntax, readability índices, as well as information on lexical recurrence and word connectivity, were considered. Regarding the writing process, the results of the study revealed differences between the two languages, with higher values associated with writing in English, particularly in terms of (i) pauses within words, indicating orthographic demands, and (ii) the percentage of uninterrupted writing, suggesting less interruption and fewer alterations/revisions. Correlation analysis indicated that participants exhibited a similar writing profile in both L1 and L2. In the product analysis using Coh-Metrix (English) and Nilc-Metrix (Portuguese), it was found, through readability índices, that the texts exhibited moderate complexity in both languages. Despite differences in how metrics are defined in each program, the results suggest that texts in Portuguese show a higher level of complexity when considering syntactic aspects (such as the number of words before main verbs) and semantic aspects (concreteness degree). For L2, lexical diversity remains one of the most reliable proficiency indicators, correlating with pause behavior (before words) and revision (normal production). Regarding SpeechGraphs, significant differences were observed between texts in L1 and L2 for almost all analyzed graph attributes, reflecting how the program deals with morphological characteristics of the two languages. No correlations were observed between the behavior of speakers in L1 and L2. Additionally, correlation studies were conducted between Inputlog data and Coh-Metrix and Nilc-Metrix tools, as well as between these tools and Speech Graph data. In both studies, a correspondence was observed between parameters indicative of complexity in the tools used, suggesting a relevant path for exploring integrated process-product analysis in future research. Regarding the correlation study between Inputlog and ANT data, notable correlations emerged between accuracy and reaction time in experimental conditions and percentages of deletions. These findings pave the way for significant contributions to the field of psycholinguistics in the context of research between L1 and L2.
4

Using Coh-Metrix to Compare Cohesion Measures between the United States Senators John McCain and Barack Obama

Hultgren, Annie January 2009 (has links)
This investigation explores and analyzes speeches by John McCain and Barack Obama, who were the candidates of the United States Presidential election 2008. Ten speeches by each speaker are in a non-biased way selected from the year 2007 from their official websites when they were senators of Arizona and Illinois respectively. The analyses of the speeches concern cohesive measures and are not about what they say in their political occupation. This approach was selected to see if there are any comparisons and/or contrasts in terms of cohesion between the speakers or within their own set of speeches. The website Coh-Metrix has been used and out of it nine measures have been selected and analyzed in detail. This study looks at the average words per sentence, the average syllables per word, the Flesch Reading Ease score, the average concreteness for content words, the average minimum concreteness for content words, the mean number of higher level constituents, the type-token ratio, the syntactic structure similarity, and the average number of negations. The two speakers had overall very similar results except for a few standard deviations as in for example the average concreteness and average minimum concreteness for content words results. However, eight out of the nine measurement numbers were non-significant according to a t-test for non-matched observations and/or a chi-square goodness-of-fit test. One measurement, the average number of negation expressions per 1000 words, was nonetheless highly significant according to a t-test and chi-square test, as Obama used about twice as many negations in comparison to McCain. This study shows that the speakers’ twenty speeches are similar in terms of structure and cohesion except for the fact that Obama uses more negation expressions compared to McCain. These results do not, however, necessarily say anything else about the speakers and/or speeches. Keywords: cohesion, cohesion markers, cohesion measures/measurements, Coh-Metrix, speeches, texts
5

An Analysis of Readability of Standard Measurements in English Textbooks by Swedish Publishers from the 90’s to 2016

Leander, Mia January 2016 (has links)
Teachers employ traditional readability measurements to estimate text difficulty when assigning textbooks to meet the students’ current proficiency level. The purpose of this study was to see if modern textbooks published by five major publishing companies in Sweden were more difficult to read according to traditional readability formula compared to textbooks published in the 90’s. This study also aims to investigate whether readability in textbooks was similar among the five publishers. The readability formula Flesch Reading Ease in Coh-Metrix was utilized to calculate text difficulty in textbooks intended for Swedish grade 7. The primary material used for this study consisted of 70 texts selected from textbooks from three different time periods and five different publishers. The results from the formula indicated that the modern textbooks, published in between 2012-2016, were more difficult to read compared with the older ones. In addition, the results indicated that readability was similar among the different publishers. However, the study showed that modern textbooks published by Liber were easier to read in comparison with the older ones of this particular publisher. That modern textbooks have longer sentence structures and more complex syntax suggests increased expectations of 7th graders’ reading abilities.
6

Comportamento de Metricas de Inteligibilidade Textual em Documentos Recuperados naWeb / THE BEHAVIOR OF READABILITY METRICS IN DOCUMENTS RETRIEVED IN INTERNET AND ITS USE AS AN INFORMATION RETRIEVAL QUERY PARAMETER

Londero, Eduardo Bauer 29 March 2011 (has links)
Made available in DSpace on 2016-03-22T17:26:45Z (GMT). No. of bitstreams: 1 Dissertacao_Eduardo_Revisado.pdf: 3489154 bytes, checksum: 3c327ee0bc47d79cd4af46e065105650 (MD5) Previous issue date: 2011-03-29 / Text retrieved from the Internet through Google and Yahoo queries are evaluated using Flesch-Kincaid Grade Level, a simple assessment measure of text readability. This kind of metrics were created to help writers to evaluate their text, and recently in automatic text simplification for undercapable readers. In this work we apply these metrics to documents freely retrieved from the Internet, seeking to find correlations between legibility and relevance acknowledged to then by search engines. The initial premise guiding the comparison between readability and relevance is the statement known as Occam s Principle, or Principle of Economy. This study employs Flesch-Kincaid Grade Level in text documents retrieved from the Internet through search-engines queries and correlate it with the position. It was found a centralist trend in the texts recovered. The centralist tendency mean that the average spacing of groups of files from the average of the category they belong is meaningfull. With this measure is possible to establish a correlation between relevance and legibility, and also, to detect diferences in the way both search engines derive their relevance calculation. A subsequent experiment seeks to determine whether the measure of legibility can be employed to assist him or her choosing a document combined with original search engine ranking and if it is useful as advance information for choice and user navigation. In a final experiment, based on previously obtained knowledge, a comparison between Wikipedia and Britannica encyclopedias by employing the metric of understandability Flesch-Kincaid / Textos recuperados da Internet por interm´edio de consultas ao Google e Yahoo s ao analisados segundo uma m´etrica simples de avaliac¸ ao de inteligibilidade textual. Tais m´etricas foram criadas para orientar a produc¸ ao textual e recentemente tamb´em foram empregadas em simplificadores textuais autom´aticos experimentais para leitores inexperientes. Nesse trabalho aplicam-se essas m´etricas a texto originais livres, recuperados da Internet, para buscar correlacionar o grau de inteligibilidade textual com a relev ancia que lhes ´e conferida pelos buscadores utilizados. A premissa inicial a estimular a comparac¸ ao entre inteligibilidade e relev ancia ´e o enunciado conhecido como Princ´&#305;pio de Occam, ou princ´&#305;pio da economia. Observa-se uma tend encia centralista que ocorre a partir do pequeno afastamento m´edio dos grupos de arquivos melhor colocados no ranking em relac¸ ao `a m´edia da categoria a que pertencem. ´E com a medida do afastamento m´edio que se consegue verificar correlac¸ ao com a posic¸ ao do arquivo no ranking e ´e tamb´em com essa medida que se consegue registrar diferenc¸as entre o m´etodo de calcular a relev ancia do Google e do Yahoo. Um experimento que decorre do primeiro estudo procura determinar se a medida de inteligibilidade pode ser empregada para auxiliar o usu´ario da Internet a escolher arquivos mais simples ou se a sua indicac¸ ao junto `a listagem de links recuperados ´e ´util e informativa para a escolha e navegac¸ ao do usu´ario. Em um experimento final, embasado no conhecimento previamente obtido, s ao comparadas as enciclop´edias Brit anica eWikip´edia por meio do emprego da m´etrica de inteligibilidade Flesch-Kincaid Grade Level
7

Processamento de língua natural e níveis de proficiência do português : um estudo de produções textuais do exame Celpe-Bras

Evers, Aline January 2013 (has links)
Este trabalho trata dos temas da proficiência em português como língua adicional e da detecção de padrões lexicais e coesivos a partir de um enfoque computacional, situando o tema em meio à descrição de textos produzidos no contexto do exame de proficiência Celpe- Bras de 2006-1. Fazendo uso de pressupostos teórico-metodológicos da Linguística de Corpus, da Linguística Textual e do Processamento de Língua Natural, investigou-se a hipótese de que seria possível classificar, de modo automático, textos submetidos ao exame conforme níveis de proficiência pré-estabelecidos. Por meio do processamento de 177 textos previamente avaliados por corretores humanos em seis níveis (Iniciante, Básico, Intermediário, Intermediário Superior, Avançado e Avançado Superior), usou-se o Aprendizado de Máquina (AM) supervisionado para cotejar padrões lexicais e coesivos capazes de distinguir os níveis sob estudo. Para o cotejo dos padrões, a ferramenta Coh-Metrix-Port – que calcula parâmetros de coesão, coerência e inteligibilidade textual – foi utilizada. Cada um dos textos foi processado na ferramenta; para o AM, os resultados da ferramenta Coh-Metrix-Port foram usados como atributos, os níveis de proficiência como classes e os textos como instâncias. As etapas de processamento do corpus foram: 1) digitação do corpus; 2) processamento individual dos textos na ferramenta Coh-Metrix-Port; 3) análise usando AM – Algoritmo J48 – e os seis níveis de proficiência; 4) nova análise usando AM e duas novas classes: textos sem certificação (Iniciante e Básico) e com certificação (Intermediário, Intermediário Superior, Avançado e Avançado Superior). Avançado e Avançado Superior). Apesar do tamanho reduzido do corpus, foi possível identificar os seguintes atributos distintivos entre os textos da amostra: número de palavras, medida de riqueza lexical, número de parágrafos, incidência de conectivos negativos, incidência de adjetivos e Índice Flesch. Chegou-se a um classificador capaz de separar dois conjuntos de texto (SEM e COM CERTIFICAÇÃO) através das métricas utilizadas (fmeasure de 70%). / This research analyzes Portuguese proficiency from a computational perspective, studying texts submitted to the Brazilian Portuguese proficiency exam Celpe-Bras (Certificate of Proficiency in Portuguese for Foreigners). The study was based on Corpus Linguistics, Textual Linguistics, and Natural Language Processing. We investigated the hypothesis that it would be possible to predict second language proficiency using Machine Learning (ML), measures given by a NLP tool (Coh-Metrix-Port), and a corpus of texts previously classified by human raters. The texts (177) were previously classified as Beginner, Elementary, Intermediate, Upper Intermediate, Advanced, and Upper Advanced. After preparation, they were processed by Coh-Metrix-Port, a tool that calculates cohesion, coherence, and textual readability at different linguistic levels. The output of this tool provided 48 measures that were used as attributes, the proficiency levels given by raters were considered classes, and the 177 were considered instances for ML purposes. The algorithm J48 was used with this set of texts, providing a Decision Tree that classified the six levels of proficiency. The results for this analysis were not conclusive; because of that, we performed a new analysis with a new set of texts: two classes, one with texts that did not receive certificate (Beginner and Elementary) and the other with texts that did receive the certificate (Intermediate, Upper Intermediate, Advanced, and Upper Advanced). Despite the small size of the corpus, we were able to identify the following distinguishing attributes: number of words, type token ratio, number of paragraphs, incidence of negative connectives, incidence of adjectives, and Flesch Index. The classifier was able to separate these two last sets of texts with a F-measure of 70%.
8

Processamento de língua natural e níveis de proficiência do português : um estudo de produções textuais do exame Celpe-Bras

Evers, Aline January 2013 (has links)
Este trabalho trata dos temas da proficiência em português como língua adicional e da detecção de padrões lexicais e coesivos a partir de um enfoque computacional, situando o tema em meio à descrição de textos produzidos no contexto do exame de proficiência Celpe- Bras de 2006-1. Fazendo uso de pressupostos teórico-metodológicos da Linguística de Corpus, da Linguística Textual e do Processamento de Língua Natural, investigou-se a hipótese de que seria possível classificar, de modo automático, textos submetidos ao exame conforme níveis de proficiência pré-estabelecidos. Por meio do processamento de 177 textos previamente avaliados por corretores humanos em seis níveis (Iniciante, Básico, Intermediário, Intermediário Superior, Avançado e Avançado Superior), usou-se o Aprendizado de Máquina (AM) supervisionado para cotejar padrões lexicais e coesivos capazes de distinguir os níveis sob estudo. Para o cotejo dos padrões, a ferramenta Coh-Metrix-Port – que calcula parâmetros de coesão, coerência e inteligibilidade textual – foi utilizada. Cada um dos textos foi processado na ferramenta; para o AM, os resultados da ferramenta Coh-Metrix-Port foram usados como atributos, os níveis de proficiência como classes e os textos como instâncias. As etapas de processamento do corpus foram: 1) digitação do corpus; 2) processamento individual dos textos na ferramenta Coh-Metrix-Port; 3) análise usando AM – Algoritmo J48 – e os seis níveis de proficiência; 4) nova análise usando AM e duas novas classes: textos sem certificação (Iniciante e Básico) e com certificação (Intermediário, Intermediário Superior, Avançado e Avançado Superior). Avançado e Avançado Superior). Apesar do tamanho reduzido do corpus, foi possível identificar os seguintes atributos distintivos entre os textos da amostra: número de palavras, medida de riqueza lexical, número de parágrafos, incidência de conectivos negativos, incidência de adjetivos e Índice Flesch. Chegou-se a um classificador capaz de separar dois conjuntos de texto (SEM e COM CERTIFICAÇÃO) através das métricas utilizadas (fmeasure de 70%). / This research analyzes Portuguese proficiency from a computational perspective, studying texts submitted to the Brazilian Portuguese proficiency exam Celpe-Bras (Certificate of Proficiency in Portuguese for Foreigners). The study was based on Corpus Linguistics, Textual Linguistics, and Natural Language Processing. We investigated the hypothesis that it would be possible to predict second language proficiency using Machine Learning (ML), measures given by a NLP tool (Coh-Metrix-Port), and a corpus of texts previously classified by human raters. The texts (177) were previously classified as Beginner, Elementary, Intermediate, Upper Intermediate, Advanced, and Upper Advanced. After preparation, they were processed by Coh-Metrix-Port, a tool that calculates cohesion, coherence, and textual readability at different linguistic levels. The output of this tool provided 48 measures that were used as attributes, the proficiency levels given by raters were considered classes, and the 177 were considered instances for ML purposes. The algorithm J48 was used with this set of texts, providing a Decision Tree that classified the six levels of proficiency. The results for this analysis were not conclusive; because of that, we performed a new analysis with a new set of texts: two classes, one with texts that did not receive certificate (Beginner and Elementary) and the other with texts that did receive the certificate (Intermediate, Upper Intermediate, Advanced, and Upper Advanced). Despite the small size of the corpus, we were able to identify the following distinguishing attributes: number of words, type token ratio, number of paragraphs, incidence of negative connectives, incidence of adjectives, and Flesch Index. The classifier was able to separate these two last sets of texts with a F-measure of 70%.
9

Processamento de língua natural e níveis de proficiência do português : um estudo de produções textuais do exame Celpe-Bras

Evers, Aline January 2013 (has links)
Este trabalho trata dos temas da proficiência em português como língua adicional e da detecção de padrões lexicais e coesivos a partir de um enfoque computacional, situando o tema em meio à descrição de textos produzidos no contexto do exame de proficiência Celpe- Bras de 2006-1. Fazendo uso de pressupostos teórico-metodológicos da Linguística de Corpus, da Linguística Textual e do Processamento de Língua Natural, investigou-se a hipótese de que seria possível classificar, de modo automático, textos submetidos ao exame conforme níveis de proficiência pré-estabelecidos. Por meio do processamento de 177 textos previamente avaliados por corretores humanos em seis níveis (Iniciante, Básico, Intermediário, Intermediário Superior, Avançado e Avançado Superior), usou-se o Aprendizado de Máquina (AM) supervisionado para cotejar padrões lexicais e coesivos capazes de distinguir os níveis sob estudo. Para o cotejo dos padrões, a ferramenta Coh-Metrix-Port – que calcula parâmetros de coesão, coerência e inteligibilidade textual – foi utilizada. Cada um dos textos foi processado na ferramenta; para o AM, os resultados da ferramenta Coh-Metrix-Port foram usados como atributos, os níveis de proficiência como classes e os textos como instâncias. As etapas de processamento do corpus foram: 1) digitação do corpus; 2) processamento individual dos textos na ferramenta Coh-Metrix-Port; 3) análise usando AM – Algoritmo J48 – e os seis níveis de proficiência; 4) nova análise usando AM e duas novas classes: textos sem certificação (Iniciante e Básico) e com certificação (Intermediário, Intermediário Superior, Avançado e Avançado Superior). Avançado e Avançado Superior). Apesar do tamanho reduzido do corpus, foi possível identificar os seguintes atributos distintivos entre os textos da amostra: número de palavras, medida de riqueza lexical, número de parágrafos, incidência de conectivos negativos, incidência de adjetivos e Índice Flesch. Chegou-se a um classificador capaz de separar dois conjuntos de texto (SEM e COM CERTIFICAÇÃO) através das métricas utilizadas (fmeasure de 70%). / This research analyzes Portuguese proficiency from a computational perspective, studying texts submitted to the Brazilian Portuguese proficiency exam Celpe-Bras (Certificate of Proficiency in Portuguese for Foreigners). The study was based on Corpus Linguistics, Textual Linguistics, and Natural Language Processing. We investigated the hypothesis that it would be possible to predict second language proficiency using Machine Learning (ML), measures given by a NLP tool (Coh-Metrix-Port), and a corpus of texts previously classified by human raters. The texts (177) were previously classified as Beginner, Elementary, Intermediate, Upper Intermediate, Advanced, and Upper Advanced. After preparation, they were processed by Coh-Metrix-Port, a tool that calculates cohesion, coherence, and textual readability at different linguistic levels. The output of this tool provided 48 measures that were used as attributes, the proficiency levels given by raters were considered classes, and the 177 were considered instances for ML purposes. The algorithm J48 was used with this set of texts, providing a Decision Tree that classified the six levels of proficiency. The results for this analysis were not conclusive; because of that, we performed a new analysis with a new set of texts: two classes, one with texts that did not receive certificate (Beginner and Elementary) and the other with texts that did receive the certificate (Intermediate, Upper Intermediate, Advanced, and Upper Advanced). Despite the small size of the corpus, we were able to identify the following distinguishing attributes: number of words, type token ratio, number of paragraphs, incidence of negative connectives, incidence of adjectives, and Flesch Index. The classifier was able to separate these two last sets of texts with a F-measure of 70%.

Page generated in 0.4357 seconds