• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 247
  • 124
  • 44
  • 38
  • 31
  • 29
  • 24
  • 24
  • 13
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • Tagged with
  • 629
  • 629
  • 144
  • 132
  • 122
  • 115
  • 95
  • 89
  • 87
  • 82
  • 81
  • 77
  • 72
  • 67
  • 66
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
551

Sumarização e extração de conceitos de notas explicativas em relatórios financeiros: ênfase nas notas das principais práticas contábeis

Cagol, Adriano 27 April 2017 (has links)
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-04-18T16:33:53Z No. of bitstreams: 1 Adriano Cagol_.pdf: 619508 bytes, checksum: 490415002d6a9bb9ff9bb7f968e23b21 (MD5) / Made available in DSpace on 2018-04-18T16:33:53Z (GMT). No. of bitstreams: 1 Adriano Cagol_.pdf: 619508 bytes, checksum: 490415002d6a9bb9ff9bb7f968e23b21 (MD5) Previous issue date: 2017-04-27 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / As demonstrações financeiras apresentam o desempenho financeiro das empresas e são uma importante ferramenta para análise da situação patrimonial e financeira, bem como para tomada de decisões de investidores, credores, fornecedores, clientes, entre outros. Nelas constam as notas explicativas que descrevem em detalhes as práticas e políticas de comunicação dos métodos de contabilidade da empresa, além de informações adicionais. Dependendo dos objetivos, não é possível uma correta análise da situação de uma entidade através das demonstrações financeiras, sem a interpretação e análise das notas explicativas que as acompanham. Porém, apesar da importância, a análise automática das notas explicativas das demonstrações financeiras ainda é um obstáculo. Em vista desta deficiência, este trabalho propõe um modelo que aplica técnicas de mineração textual para efetivar a extração de conceitos e a sumarização das notas explicativas, relativas à seção de principais práticas contábeis adotadas pela empresa, no sentido de identificar e estruturar os principais métodos de apuração de contas contábeis e a geração de resumos. Um algoritmo de extração de conceitos e seis algoritmos de sumarização foram aplicados sobre as notas explicativas das demonstrações financeiras de empresas da Comissão de Valores Mobiliários do Brasil. O trabalho mostra que a extração de conceitos gera resultados promissores para identificação do método de apuração da conta contábil, visto que apresenta acurácia de 100% na nota explicativa do estoque e do imobilizado e acurácia de 96,97% na nota explicativa do reconhecimento da receita. Além disso, avalia os algoritmos de sumarização com a medida ROUGE, apontando os mais promissores, com destaque para o LexRank, que no geral conseguiu as melhores avaliações. / Financial statements present the financial performance of companies and are an important tool for analyzing the financial and equity situation, as well as for making decisions of investors, creditors, suppliers, customers, among others. These are listed explanatory notes that describe in detail how practices and policies of accounting methods of the company. Depending on the objectives, a correct analysis of the situation of a company on the financial statements is not possible without an interpretation and analysis of the footnotes. However, despite the importance, an automatic analysis of the footnotes to the financial statements is still an obstacle. In view of this deficiency, this work proposes a model that applies text mining techniques without the sense of identifying the main methods of calculating the accounting accounts, the reports in the footnotes, with concept extraction, as well as generating a summary that contemplates the main idea of these, through summarization. A concept extraction algorithm and six summarization algorithms are applied in financial statements of companies of Brazilian Securities and Exchange Commission. The work shows that concept extraction generates promising results for the identification of the method of calculating the accounting account, since it presents a 100% accuracy in the footnote of inventory and property, plant and equipment, and accuracy of 96.97% in the footnote on revenue recognition. In addition, it evaluates the algorithms for summarization with the ROUGE measure, pointing out the most promising ones, especially LexRank, which in general obtained the best evaluations.
552

Suporte às micro e pequenas empresas a partir da gestão baseada em evidências: construção de ferramenta computacional baseada em inteligência artificial

Santos, Andrey Schmidt dos 26 February 2018 (has links)
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2018-04-26T13:33:37Z No. of bitstreams: 1 Andrey Schmidt dos Santos_.pdf: 3821784 bytes, checksum: 3ec0002a0c8656aa8110f2ae6166b117 (MD5) / Made available in DSpace on 2018-04-26T13:33:37Z (GMT). No. of bitstreams: 1 Andrey Schmidt dos Santos_.pdf: 3821784 bytes, checksum: 3ec0002a0c8656aa8110f2ae6166b117 (MD5) Previous issue date: 2018-02-26 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / As micro e pequenas empresas (MPEs) constituem 99% das empresas no Brasil, sendo responsáveis por 70% dos empregos formais e 27% do produto interno bruto. Apesar dessa representatividade, o grau de instrução nas MPEs ainda é baixo. Esse baixo nível de instrução dificulta a tomada de decisão. Uma alternativa para melhorar a tomada de decisão é utilizar a gestão baseada em evidências (EBM). A EBM é uma abordagem que ajuda a encontrar evidências e a avaliá-las criticamente. Uma organização que ajuda as MPEs na busca de evidências e na tomada de decisão é o Serviço Brasileiro de Apoio às Micro e Pequenas Empresas (SEBRAE). O SEBRAE possui uma central de atendimentos com capacidade limitada de suporte a MPEs. Essa capacidade pode ser aumentada utilizando tecnologias da inteligência artificial (IA). Uma revisão de literatura demonstrou a ausência de referências na utilização da IA para aplicação da EBM em MPEs. Diante desse contexto, a pesquisa responde como seria uma ferramenta computacional para suportar as demandas técnicas no contexto de MPEs. Para responder ao problema de pesquisa, construiu-se uma ferramenta computacional que suporta as demandas técnicas de MPEs a partir da EBM. Para tanto, desenvolveu-se um método de trabalho baseado na Design Science Research (DSR). Com base na DSR, construiu-se um artefato com um módulo de pergunta e resposta e um módulo de aprendizado. Após quatro rodadas de aprendizado, o artefato apresentou uma acurácia de 90,70%. Realizou-se, ainda, um experimento para comparar o desempenho do artefato com a performance da central de atendimento do SEBRAE. Na dimensão qualidade, o artefato apresentou um desempenho, correspondente a 53,59% do atendimento da central do SEBRAE. Na dimensão tempo, o artefato apresentou resultados superiores aos da central de atendimentos. O trabalho contribui para a literatura ao desenvolver um artefato que aplique a EBM. O SEBRAE beneficia-se com uma alternativa que possibilita aumentar a capacidade de atendimento. O artefato pode ser utilizado para complementar e agilizar o atendimento a MPEs. / Small and Medium Enterprises (SMEs) compose 99% of companies in Brazil, 70% of formal jobs and 27% of gross domestic product. Despite this representativeness, the level of education in SMEs is low. This education level difficult decision-making. One alternative to improve SMEs decision making is evidence-based management (EBM). EBM is an approach that helps to acquire and appraise evidence. One organization that helps SMEs find evidence and make decisions is the Brazilian Small and Medium Enterprises Support Service (SEBRAE). SEBRAE has a SMEs call center with limited service capacity. This capacity can be increased with artificial intelligence technologies (AI). A literature review has demonstrated the lack of literature in the use of IA for the application of EBM in SMEs. In this context, what would be a computational tool to support the technical demands in the context of SMEs? To answer this problem, the research goal was create a computational tool that supports the SMEs technical demands from EBM. To create this tool, a working method based on design science research (DSR) was developed. Using the DSR, an artifact with ask-answer module and learning module was created. After four learning rounds, the artifact presented an accuracy of 90,70%. An experiment was carried out to compare the artifact with the SEBRAE call center. In the quality dimension, the artifact presented a performance similar to 53,59% of the call center. In the time dimension, the artifact presented better results than call center. The work contributes to the literature by developing an artifact that applies EBM. SEBRAE benefited from an alternative to increase its service capacity. The artifact can be used to complement and expedite the SMEs call center service.
553

文字探勘在總體經濟上之應用- 以美國聯準會會議紀錄為例 / The application of text mining on macroeconomics : a case study of FOMC minutes

黃于珊, Huang, Yu Shan Unknown Date (has links)
本研究以1993年到2017年3月間的193篇FOMC Minutes作為研究素材,先採監督式學習方法,利用潛在語意分析(latent semantic analysis,LSA)萃取出升息、降息及不變樣本的潛在語意,再以線性判別分析(Linear Discriminant Analysis, LDA)進行分類;此外,本研究亦透過非監督式學習方法中的探索性資料分析(Exploratory Data Analysis, EDA),試圖從FOMC Minutes中找尋相關變數。研究結果發現,LSA可大致區分出升息、降息及不變樣本的特徵,而EDA能找出不同時期或不同類別下的重要單詞,呈現文本的結構變化,亦能進行文本分群。 / In this study, 193 FOMC Minutes from 1993 to March 2017 were used as research materials. The latent semantic analysis (LSA) in supervised learning methods was used to extract the potential semantics of interest rate increased, decreased, and unchanged samples, and then linear discriminant analysis (LDA) was used for classification. In addition, this study attempts to find relevant variables from FOMC Minutes through exploratory data analysis (EDA) in unsupervised learning methods. The results show that LSA can distinguish the characteristics of interest rate increased, decreased, and unchanged samples. EDA can find relevant words in different periods or different categories, show changes in the text structure, and can also classify the texts.
554

Um estudo sobre o papel de medidas de similaridade em visualização de coleções de documentos / A study on the role of similarity measures in visual text analytics

Frizzi Alejandra San Roman Salazar 27 September 2012 (has links)
Técnicas de visualização de informação, tais como as que utilizam posicionamento de pontos baseado na similaridade do conteúdo, são utilizadas para criar representações visuais de dados que evidenciem certos padrões. Essas técnicas são sensíveis à qualidade dos dados, a qual, por sua vez, depende de uma etapa de pré-processamento muito influente. Esta etapa envolve a limpeza do texto e, em alguns casos, a detecção de termos e seus pesos, bem como a definição de uma função de (dis)similaridade. Poucos são os estudos realizados sobre como esses cálculos de (dis)similaridade afetam a qualidade das representações visuais geradas para dados textuais. Este trabalho apresenta um estudo sobre o papel das diferentes medidas de (dis)similaridade entre pares de textos na geração de mapas visuais. Nos concentramos principalmente em dois tipos de funções de distância, aquelas computadas a partir da representação vetorial do texto (Vector Space Model (VSM)) e em medidas de comparação direta de strings textuais. Comparamos o efeito na geração de mapas visuais com técnicas de posicionamento de pontos, utilizando as duas abordagens. Para isso, foram utilizadas medidas objetivas para comparar a qualidade visual dos mapas, tais como Neighborhood Hit (NH) e Coeficiente de Silhueta (CS). Descobrimos que ambas as abordagens têm pontos a favor, mas de forma geral, o VSM apresentou melhores resultados quanto à discriminação de classes. Porém, a VSM convencional não é incremental, ou seja, novas adições à coleção forçam o recálculo do espaço de dados e das dissimilaridades anteriormente computadas. Nesse sentido, um novo modelo incremental baseado no VSM (Incremental Vector Space Model (iVSM)) foi considerado em nossos estudos comparativos. O iVSM apresentou os melhores resultados quantitativos e qualitativos em diversas configurações testadas. Os resultados da avaliação são apresentados e recomendações sobre a aplicação de diferentes medidas de similaridade de texto em tarefas de análise visual, são oferecidas / Information visualization techniques, such as similarity based point placement, are used for generating of visual data representation that evidence some patterns. These techniques are sensitive to data quality, which depends of a very influential preprocessing step. This step involves cleaning the text and in some cases, detecting terms and their weights, as well as definiting a (dis)similarity function. There are few studies on how these (dis)similarity calculations aect the quality of visual representations for textual data. This work presents a study on the role of the various (dis)similarity measures in generating visual maps. We focus primarily on two types of distance functions, those based on vector representations of the text (Vector Space Model (VSM)) and measures obtained from direct comparison of text strings, comparing the effect on the visual maps obtained with point placement techniques with the two approaches. For this, objective measures were employed to compare the visual quality of the generated maps, such as the Neighborhood Hit and Silhouette Coefficient. We found that both approaches have strengths, but in general, the VSM showed better results as far as class discrimination is concerned. However, the conventional VSM is not incremental, i.e., new additions to the collection force the recalculation of the data space and dissimilarities previously computed. Thus, a new model based on incremental VSM (Incremental Vector Space Model (iVSM)) has been also considered in our comparative studies. iVSM showed the best quantitative and qualitative results in several of the configurations considered. The evaluation results are presented and recommendations on the application of different similarity measures for text analysis tasks visually are provided
555

Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sak

Öhrström, Fredrik January 2018 (has links)
Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.
556

Contribution to automatic text classification : metrics and evolutionary algorithms / Contributions à la classification automatique de texte : métriques et algorithmes évolutifs

Mazyad, Ahmad 22 November 2018 (has links)
Cette thèse porte sur le traitement du langage naturel et l'exploration de texte, à l'intersection de l'apprentissage automatique et de la statistique. Nous nous intéressons plus particulièrement aux schémas de pondération des termes (SPT) dans le contexte de l'apprentissage supervisé et en particulier à la classification de texte. Dans la classification de texte, la tâche de classification multi-étiquettes a suscité beaucoup d'intérêt ces dernières années. La classification multi-étiquettes à partir de données textuelles peut être trouvée dans de nombreuses applications modernes telles que la classification de nouvelles où la tâche est de trouver les catégories auxquelles appartient un article de presse en fonction de son contenu textuel (par exemple, politique, Moyen-Orient, pétrole), la classification du genre musical (par exemple, jazz, pop, oldies, pop traditionnelle) en se basant sur les commentaires des clients, la classification des films (par exemple, action, crime, drame), la classification des produits (par exemple, électronique, ordinateur, accessoires). La plupart des algorithmes d'apprentissage ne conviennent qu'aux problèmes de classification binaire. Par conséquent, les tâches de classification multi-étiquettes sont généralement transformées en plusieurs tâches binaires à label unique. Cependant, cette transformation introduit plusieurs problèmes. Premièrement, les distributions des termes ne sont considérés qu'en matière de la catégorie positive et de la catégorie négative (c'est-à-dire que les informations sur les corrélations entre les termes et les catégories sont perdues). Deuxièmement, il n'envisage aucune dépendance vis-à-vis des étiquettes (c'est-à-dire que les informations sur les corrélations existantes entre les classes sont perdues). Enfin, puisque toutes les catégories sauf une sont regroupées dans une seule catégories (la catégorie négative), les tâches nouvellement créées sont déséquilibrées. Ces informations sont couramment utilisées par les SPT supervisés pour améliorer l'efficacité du système de classification. Ainsi, après avoir présenté le processus de classification de texte multi-étiquettes, et plus particulièrement le SPT, nous effectuons une comparaison empirique de ces méthodes appliquées à la tâche de classification de texte multi-étiquette. Nous constatons que la supériorité des méthodes supervisées sur les méthodes non supervisées n'est toujours pas claire. Nous montrons ensuite que ces méthodes ne sont pas totalement adaptées au problème de la classification multi-étiquettes et qu'elles ignorent beaucoup d'informations statistiques qui pourraient être utilisées pour améliorer les résultats de la classification. Nous proposons donc un nouvel SPT basé sur le gain d'information. Cette nouvelle méthode prend en compte la distribution des termes, non seulement en ce qui concerne la catégorie positive et la catégorie négative, mais également en rapport avec toutes les autres catégories. Enfin, dans le but de trouver des SPT spécialisés qui résolvent également le problème des tâches déséquilibrées, nous avons étudié les avantages de l'utilisation de la programmation génétique pour générer des SPT pour la tâche de classification de texte. Contrairement aux études précédentes, nous générons des formules en combinant des informations statistiques à un niveau microscopique (par exemple, le nombre de documents contenant un terme spécifique) au lieu d'utiliser des SPT complets. De plus, nous utilisons des informations catégoriques telles que (par exemple, le nombre de catégories dans lesquelles un terme apparaît). Des expériences sont effectuées pour mesurer l'impact de ces méthodes sur les performances du modèle. Nous montrons à travers ces expériences que les résultats sont positifs. / This thesis deals with natural language processing and text mining, at the intersection of machine learning and statistics. We are particularly interested in Term Weighting Schemes (TWS) in the context of supervised learning and specifically the Text Classification (TC) task. In TC, the multi-label classification task has gained a lot of interest in recent years. Multi-label classification from textual data may be found in many modern applications such as news classification where the task is to find the categories that a newswire story belongs to (e.g., politics, middle east, oil), based on its textual content, music genre classification (e.g., jazz, pop, oldies, traditional pop) based on customer reviews, film classification (e.g. action, crime, drama), product classification (e.g. Electronics, Computers, Accessories). Traditional classification algorithms are generally binary classifiers, and they are not suited for the multi-label classification. The multi-label classification task is, therefore, transformed into multiple single-label binary tasks. However, this transformation introduces several issues. First, terms distributions are only considered in relevance to the positive and the negative categories (i.e., information on the correlations between terms and categories is lost). Second, it fails to consider any label dependency (i.e., information on existing correlations between classes is lost). Finally, since all categories but one are grouped into one category (the negative category), the newly created tasks are imbalanced. This information is commonly used by supervised TWS to improve the effectiveness of the classification system. Hence, after presenting the process of multi-label text classification, and more particularly the TWS, we make an empirical comparison of these methods applied to the multi-label text classification task. We find that the superiority of the supervised methods over the unsupervised methods is still not clear. We show then that these methods are not fully adapted to the multi-label classification problem and they ignore much statistical information that coul be used to improve the classification results. Thus, we propose a new TWS based on information gain. This new method takes into consideration the term distribution, not only regarding the positive and the negative categories but also in relevance to all classes. Finally, aiming at finding specialized TWS that also solve the issue of imbalanced tasks, we studied the benefits of using genetic programming for generating TWS for the text classification task. Unlike previous studies, we generate formulas by combining statistical information at a microscopic level (e.g., the number of documents that contain a specific term) instead of using complete TWS. Furthermore, we make use of categorical information such as (e.g., the number of categories where a term occurs). Experiments are made to measure the impact of these methods on the performance of the model. We show through these experiments that the results are positive.
557

Auxílio na prevenção de doenças crônicas por meio de mapeamento e relacionamento conceitual de informações em biomedicina / Support in the Prevention of Chronic Diseases by means of Mapping and Conceptual Relationship of Biomedical Information

Pollettini, Juliana Tarossi 28 November 2011 (has links)
Pesquisas recentes em medicina genômica sugerem que fatores de risco que incidem desde a concepção de uma criança até o final de sua adolescência podem influenciar no desenvolvimento de doenças crônicas da idade adulta. Artigos científicos com descobertas e estudos inovadores sobre o tema indicam que a epigenética deve ser explorada para prevenir doenças de alta prevalência como doenças cardiovasculares, diabetes e obesidade. A grande quantidade de artigos disponibilizados diariamente dificulta a atualização de profissionais, uma vez que buscas por informação exata se tornam complexas e dispendiosas em relação ao tempo gasto na procura e análise dos resultados. Algumas tecnologias e técnicas computacionais podem apoiar a manipulação dos grandes repositórios de informações biomédicas, assim como a geração de conhecimento. O presente trabalho pesquisa a descoberta automática de artigos científicos que relacionem doenças crônicas e fatores de risco para as mesmas em registros clínicos de pacientes. Este trabalho também apresenta o desenvolvimento de um arcabouço de software para sistemas de vigilância que alertem profissionais de saúde sobre problemas no desenvolvimento humano. A efetiva transformação dos resultados de pesquisas biomédicas em conhecimento possível de ser utilizado para beneficiar a saúde pública tem sido considerada um domínio importante da informática. Este domínio é denominado Bioinformática Translacional (BUTTE,2008). Considerando-se que doenças crônicas são, mundialmente, um problema sério de saúde e lideram as causas de mortalidade com 60% de todas as mortes, o presente trabalho poderá possibilitar o uso direto dos resultados dessas pesquisas na saúde pública e pode ser considerado um trabalho de Bioinformática Translacional. / Genomic medicine has suggested that the exposure to risk factors since conception may influence gene expression and consequently induce the development of chronic diseases in adulthood. Scientific papers bringing up these discoveries indicate that epigenetics must be exploited to prevent diseases of high prevalence, such as cardiovascular diseases, diabetes and obesity. A large amount of scientific information burdens health care professionals interested in being updated, once searches for accurate information become complex and expensive. Some computational techniques might support management of large biomedical information repositories and discovery of knowledge. This study presents a framework to support surveillance systems to alert health professionals about human development problems, retrieving scientific papers that relate chronic diseases to risk factors detected on a patient\'s clinical record. As a contribution, healthcare professionals will be able to create a routine with the family, setting up the best growing conditions. According to Butte, the effective transformation of results from biomedical research into knowledge that actually improves public health has been considered an important domain of informatics and has been called Translational Bioinformatics. Since chronic diseases are a serious health problem worldwide and leads the causes of mortality with 60% of all deaths, this scientific investigation will probably enable results from bioinformatics researches to directly benefit public health.
558

科技政策網站內容分析之研究

賴昌彥, Lai, Chang-Yen Unknown Date (has links)
面對全球資訊網(WWW)應用蓬勃發展,網際網路上充斥著各種類型的資訊資源。而如何有效地管理及檢索這些資料,就成為當前資訊管理的重要課題之一。在發掘資訊時,最常用的便是搜尋引擎,透過比對查詢字串與索引表格(index table),找出相關的網頁文件,並回傳結果。但因為網頁描述資訊的不足,導致其回覆大量不相關的查詢結果,浪費使用者許多時間。 為了解決上述問題,就資訊搜尋的角度而言,本研究提出以文字開採技術實際分析網頁內容,並將其轉換成維度資訊來描述,再以多維度資料庫方式儲存的架構。做為改進現行資訊檢索的參考架構。 就資訊描述的角度,本研提出採用RDF(Resource Description Framework)來描述網頁Metadata的做法。透過此通用的資料格式來描述網路資源,做為跨領域使用、表達資訊的標準,便於Web應用程式間的溝通。期有效改善現行網際網路資源描述之缺失,大幅提昇搜尋之品質。
559

探索美國財務報表的主觀性詞彙與盈餘的關聯性:意見分析之應用 / Exploring the relationships between annual earnings and subjective expressions in US financial statements: opinion analysis applications

陳建良, Chen, Chien Liang Unknown Date (has links)
財務報表中的主觀性詞彙往往影響市場中的參與者對於報導公司價值和獲利能力衡量的決策判斷。因此,公司的管理階層往往有高度的動機小心謹慎的選擇用詞以隱藏負面的消息而宣揚正面的消息。然而使用人工方式從文字量極大的財務報表挖掘有用的資訊往往不可行,因此本研究採用人工智慧方法驗證美國財務報表中的主觀性多字詞 (subjective MWEs) 和公司的財務狀況是否具有關聯性。多字詞模型往往比傳統的單字詞模型更能掌握句子中的語意情境,因此本研究應用條件隨機域模型 (conditional random field) 辨識多字詞形式的意見樣式。另外,本研究的實證結果發現一些跡象可以印證一般人對於財務報表的文字揭露往往與真實的財務數字存在有落差的印象;更發現在負向的盈餘變化情況下,公司管理階層通常輕描淡寫當下的短拙卻堅定地承諾璀璨的未來。 / Subjective assertions in financial statements influence the judgments of market participants when they assess the value and profitability of the reporting corporations. Hence, the managements of corporations may attempt to conceal the negative and to accentuate the positive with "prudent" wording. To excavate this accounting phenomenon hidden behind financial statements, we designed an artificial intelligence based strategy to investigate the linkage between financial status measured by annual earnings and subjective multi-word expressions (MWEs). We applied the conditional random field (CRF) models to identify opinion patterns in the form of MWEs, and our approach outperformed previous work employing unigram models. Moreover, our novel algorithms take the lead to discover the evidences that support the common belief that there are inconsistencies between the implications of the written statements and the reality indicated by the figures in the financial statements. Unexpected negative earnings are often accompanied by ambiguous and mild statements and sometimes by promises of glorious future.
560

應用文字探勘技術於英文文章難易度分類 / The Classification of the Difficulty of English Articles with Text Mining

許珀豪, Hsu, Po Hao Unknown Date (has links)
英語學習者如何能在普及的網路環境中,挑選難易度符合自身英文閱讀能力的文章,便是一個值得探討的議題。為了提升文章難易度分類的準確度,近代研究選取許多難易度特徵去分類。本研究希望能夠藉由英文語文難易度特徵、文字特徵,各自歸類和綜合歸類後與原先官方文章類別比較,檢驗是否可以利用語文特徵與文字特徵結合後的歸類結果,來提高準度。 本研究以GEPT的模擬試題文章作為歸類的依據。研究架構主要分成三部分:語文難易度特徵歸類、文字特徵歸類與綜合前兩者歸類。先以語文難易度特徵組成特徵向量的維度,並算出各語文特徵值後,再使用kNN將文章歸類成初級、中級或中高級,並做為比較準確度的依據;再以GEPT文章斷詞,並選取特徵詞作為特徵向量維度、TF-IDF作特徵值進行文字特徵歸類;最後則是將前面兩種特徵結合作為歸類標準。分別的F-measure為0.61、0.47,最後一個、也是表現最好的結果是以兩者結合後歸類,F-measure有0.68。 如何從大量的英文文章當中找到適合自己程度循序漸進的學習,是本論文期望未來可以藉由最後語文難易度特徵加上文字特徵的結果來達到的目的。未來可以結合語文難易度特徵以及文字特徵來幫助英文文章做分類,並可以從中分類出不同類別且不同程度的英文文章,讓使用者自行選擇並閱讀,使學習成效進而提升。 / It is rather an important issue that how to grasp the difficulty of the articles in order to efficiently choose the English articles that match our proficiency in the popularity of Internet. Recently, researchers have selected many characteristics of difficulty degrees in order to enhance the accuracy of the classification. The study aims to simplify the former complicated procedures of article classification by using the classification results of linguistic difficulty characteristics, text characteristics respectively, and the combination of the both; in the hope to raise the accuracy of the classification through the comparison of the results. The article classification of the study is based on GEPT official practicing exams. There are three parts of this study: the characteristics of the linguistic difficulty and the text, and the combination of the both. First, the dimensions of the linguistic vectors will be the linguistic characteristics. The articles will be classified into primary, intermediate, or intermediate-high levels by kNN method, considered the comparison basis for the classification of the articles’ difficulty. Second, after GEPT articles are broken into words, the dimensions of the text vectors will be the selected words; the TF-IDF will be the values of the text vectors. The third part is to classify articles by using the combination of the former two results. After comparing the three, the best method is the third, the accuracy is 0.68. The study hopes the result could help people choose proper English articles to learn English step by step. In the future, we could classify the articles by the combination of the both of linguistic difficulty characteristics and text characteristics. Not only classified as the different levels, but also classified as the different categories. The learners could choose what they like and the articles could correspond their degree in order to promote the effect of learning.

Page generated in 0.1519 seconds