Global ETD Search

191	[pt] EXTRAÇÃO DE INFORMAÇÕES DE SENTENÇAS JUDICIAIS EM PORTUGUÊS / [en] INFORMATION EXTRACTION FROM LEGAL OPINIONS IN BRAZILIAN PORTUGUESE GUSTAVO MARTINS CAMPOS COELHO 03 October 2022 (has links) [pt] A Extração de Informação é uma tarefa importante no domínio jurídico. Embora a presença de dados estruturados seja escassa, dados não estruturados na forma de documentos jurídicos, como sentenças, estão amplamente disponíveis. Se processados adequadamente, tais documentos podem fornecer informações valiosas sobre processos judiciais anteriores, permitindo uma melhor avaliação por profissionais do direito e apoiando aplicativos baseados em dados. Este estudo aborda a Extração de Informação no domínio jurídico, extraindo valor de sentenças relacionados a reclamações de consumidores. Mais especificamente, a extração de cláusulas categóricas é abordada através de classificação, onde seis modelos baseados em diferentes estruturas são analisados. Complementarmente, a extração de valores monetários relacionados a indenizações por danos morais é abordada por um modelo de Reconhecimento de Entidade Nomeada. Para avaliação, um conjunto de dados foi criado, contendo 964 sentenças anotados manualmente (escritas em português) emitidas por juízes de primeira instância. Os resultados mostram uma média de aproximadamente 97 por cento de acurácia na extração de cláusulas categóricas, e 98,9 por cento na aplicação de NER para a extração de indenizações por danos morais. / [en] Information Extraction is an important task in the legal domain. While the presence of structured and machine-processable data is scarce, unstructured data in the form of legal documents, such as legal opinions, is largely available. If properly processed, such documents can provide valuable information with regards to past lawsuits, allowing better assessment by legal professionals and supporting data-driven applications. This study addresses Information Extraction in the legal domain by extracting value from legal opinions related to consumer complaints. More specifically, the extraction of categorical provisions is addressed by classification, where six models based on different frameworks are analyzed. Moreover, the extraction of monetary values related to moral damage compensations is addressed by a Named Entity Recognition (NER) model. For evaluation, a dataset was constructed, containing 964 manually annotated legal opinions (written in Brazilian Portuguese) enacted by lower court judges. The results show an average of approximately 97 percent of accuracy when extracting categorical provisions, and 98.9 percent when applying NER for the extraction of moral damage compensations. [pt] EXTRACAO DE INFORMACAO [pt] EXTRACAO DE VARIAVEIS EM TEXTOS [pt] PROCESSAMENTO DE LINGUAGEM NATURAL [pt] CLASSIFICACAO DE TEXTOS [en] EXTRATION OF INFORMATION [en] TEXT FEATURE EXTRACTION [en] NAMED ENTITY RECOGNITION [en] NATURAL LANGUAGE PROCESSING [en] TEXT CLASSIFICATION
192	The Struggle Against Misinformation: Evaluating the Performance of Basic vs. Complex Machine Learning Models on Manipulated Data Valladares Parker, Diego Gabriel January 2024 (has links) This study investigates the application of machine learning (ML) techniques in detecting fake news, addressing the rapid spread of misinformation across social media platforms. Given the time-consuming nature of manual fact-checking, this research compares the robustness of basic machine learning models, such as Multinominal Naive Bayes classifiers, with complex models like Distil-BERT in identifying fake news. Utilizing datasets including LIAR, ISOT, and GM, this study will evaluate these models based on standard classification metrics both in single domain and cross-domain scenarios, especially when processing linguistically manipulated data. Results indicate that while complex models like Distil-BERT perform better in single-domain classifications, the Baseline models show competitive performance in cross-domain and on the manipulated dataset. However both models struggle with the manipulated dataset, highlighting a critical area for improvement in fake news detection algorithms and methods. In conclusion, the findings suggest that while both basic and complex models have their strength in certain settings, significant advancements are needed to improve against linguistic manipulations, ensuring reliable detection of fake news across varied contexts before consideration of public availability of automated classification. Fake News Detection Machine Learning Natural Language Processing Information Dissem ination Text Classification Lexical Analysis Neural Networks Cross-Domain Validation Al gorithmic Bias Misinformation
193	Variações do método kNN e suas aplicações na classificação automática de textos / kNN Method Variations and its applications in Text Classification SANTOS, Fernando Chagas 10 October 2010 (has links) Made available in DSpace on 2014-07-29T14:57:46Z (GMT). No. of bitstreams: 1 dissertacao-fernando.pdf: 677510 bytes, checksum: 19704f0b04ee313a63b053f7f9df409c (MD5) Previous issue date: 2010-10-10 / Most research on Automatic Text Categorization (ATC) seeks to improve the classifier performance (effective or efficient) responsible for automatically classifying a document d not yet rated. The k nearest neighbors (kNN) is simpler and it s one of automatic classification methods more effective as proposed. In this paper we proposed two kNN variations, Inverse kNN (kINN) and Symmetric kNN (kSNN) with the aim of improving the effectiveness of ACT. The kNN, kINN and kSNN methods were applied in Reuters, 20ng and Ohsumed collections and the results showed that kINN and kSNN methods were more effective than kNN method in Reuters and Ohsumed collections. kINN and kSNN methods were as effective as kNN method in 20NG collection. In addition, the performance achieved by kNN method is more stable than kINN and kSNN methods when the value k change. A parallel study was conducted to generate new features in documents from the similarity matrices resulting from the selection criteria for the best results obtained in kNN, kINN and kSNN methods. The SVM (considered a state of the art method) was applied in Reuters, 20NG and Ohsumed collections - before and after applying this approach to generate features in these documents and the results showed statistically significant gains for the original collection. / Grande parte das pesquisas relacionadas com a classificação automática de textos (CAT) tem procurado melhorar o desempenho (eficácia ou eficiência) do classificador responsável por classificar automaticamente um documento d, ainda não classificado. O método dos k vizinhos mais próximos (kNN, do inglês k nearest neighbors) é um dos métodos de classificação automática mais simples e eficazes já propostos. Neste trabalho foram propostas duas variações do método kNN, o kNN invertido (kINN) e o kNN simétrico (kSNN) com o objetivo de melhorar a eficácia da CAT. Os métodos kNN, kINN e kSNN foram aplicados nas coleções Reuters, 20NG e Ohsumed e os resultados obtidos demonstraram que os métodos kINN e kSNN tiveram eficácia superior ao método kNN ao serem aplicados nas coleções Reuters e Ohsumed e eficácia equivalente ao método kNN ao serem aplicados na coleção 20NG. Além disso, nessas coleções foi possível verificar que o desempenho obtido pelo método kNN é mais estável a variação do valor k do que os desempenhos obtidos pelos métodos kINN e kSNN. Um estudo paralelo foi realizado para gerar novas características em documentos a partir das matrizes de similaridade resultantes dos critérios de seleção dos melhores resultados obtidos na avaliação dos métodos kNN, kINN e kSNN. O método SVM, considerado um método de classificação do estado da arte em relação à eficácia, foi aplicado nas coleções Reuters, 20NG e Ohsumed - antes e após aplicar a abordagem de geração de características nesses documentos e os resultados obtidos demonstraram ganhos estatisticamente significativos em relação à coleção original. Classificação de Textos Aprendizagem de Máquina Método kNN Critérios de Seleção Geração de Características Geração de Termos Text Classification Machine Learning kNN Method Feature Selection Feature Construction
194	An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition Tsatsaronis, George 10 October 2017 (has links) (PDF) This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies. BioASQ-Wettbewerb hierarchische Textsystematik semantische Indizierung Informationsbeschaffung Passage Retrieval Fragebeantwortung Mehrfachdokumentation Technsiche Universität Dresden Publikationsfonds BioASQ Competition Hierarchical Text Classification Semantic indexing Information retrieval Passage retrieval Question answering Multi-document text summarization Technische Universität Dresden Publishing Fund ddc:570 rvk:WA 15000
195	Classification automatique de textes pour les revues de littérature mixtes en santé Langlois, Alexis 12 1900 (has links) Les revues de littérature sont couramment employées en sciences de la santé pour justifier et interpréter les résultats d’un ensemble d’études. Elles permettent également aux chercheurs, praticiens et décideurs de demeurer à jour sur les connaissances. Les revues dites systématiques mixtes produisent un bilan des meilleures études portant sur un même sujet tout en considérant l’ensemble des méthodes de recherche quantitatives et qualitatives. Leur production est ralentie par la prolifération des publications dans les bases de données bibliographiques et la présence accentuée de travaux non scientifiques comme les éditoriaux et les textes d’opinion. Notamment, l’étape d’identification des études pertinentes pour l’élaboration de telles revues s’avère laborieuse et requiert un temps considérable. Traditionnellement, le triage s’effectue en utilisant un ensemble de règles établies manuellement. Dans cette étude, nous explorons la possibilité d’utiliser la classification automatique pour exécuter cette tâche. La famille d’algorithmes ayant été considérée dans le comparatif de ce travail regroupe les arbres de décision, la classification naïve bayésienne, la méthode des k plus proches voisins, les machines à vecteurs de support ainsi que les approches par votes. Différentes méthodes de combinaison de caractéristiques exploitant les termes numériques, les symboles ainsi que les synonymes ont été comparés. La pertinence des concepts issus d’un méta-thésaurus a également été mesurée. En exploitant les résumés et les titres d’approximativement 10 000 références, les forêts d’arbres de décision admettent le plus haut taux de succès (88.76%), suivies par les machines à vecteurs de support (86.94%). L’efficacité de ces approches devance la performance des filtres booléens conçus pour les bases de données bibliographiques. Toutefois, une sélection judicieuse des entrées de la collection d’entraînement est cruciale pour pallier l’instabilité du modèle final et la disparité des méthodologies quantitatives et qualitatives des études scientifiques existantes. / The interest of health researchers and policy-makers in literature reviews has continued to increase over the years. Mixed studies reviews are highly valued since they combine results from the best available studies on various topics while considering quantitative, qualitative and mixed research methods. These reviews can be used for several purposes such as justifying, designing and interpreting results of primary studies. Due to the proliferation of published papers and the growing number of nonempirical works such as editorials and opinion letters, screening records for mixed studies reviews is time consuming. Traditionally, reviewers are required to manually identify potential relevant studies. In order to facilitate this process, a comparison of different automated text classification methods was conducted in order to determine the most effective and robust approach to facilitate systematic mixed studies reviews. The group of algorithms considered in this study combined decision trees, naive Bayes classifiers, k-nearest neighbours, support vector machines and voting approaches. Statistical techniques were applied to assess the relevancy of multiple features according to a predefined dataset. The benefits of feature combination for numerical terms, synonyms and mathematical symbols were also measured. Furthermore, concepts extracted from a metathesaurus were used as additional features in order to improve the training process. Using the titles and abstracts of approximately 10,000 entries, decision trees perform the best with an accuracy of 88.76%, followed by support vector machine (86.94%). The final model based on decision trees relies on linear interpolation and a group of concepts extracted from a metathesaurus. This approach outperforms the mixed filters commonly used with bibliographic databases like MEDLINE. However, references chosen for training must be selected judiciously in order to address the model instability and the disparity of quantitative and qualitative study designs. Classification automatique Revue de littérature Étude mixte Méthode de recherche Santé Arbre de décision Machine à vecteurs de support Automated text classification Systematic review Mixed study Research method Health care Decision tree Support vector machine
196	An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition Tsatsaronis, George 10 October 2017 (has links) This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies. info:eu-repo/classification/ddc/570 ddc:570
197	Inteligentní emailová schránka / Intelligent Mailbox Pohlídal, Antonín January 2012 (has links) This master's thesis deals with the use of text classification for sorting of incoming emails. First, there is described the Knowledge Discovery in Databases and there is also analyzed in detail the text classification with selected methods. Further, this thesis describes the email communication and SMTP, POP3 and IMAP protocols. The next part contains design of the system that classifies incoming emails and there are also described realated technologie ie Apache James Server, PostgreSQL and RapidMiner. Further, there is described the implementation of all necessary components. The last part contains an experiments with email server using Enron Dataset.
198	All Negative on the Western Front: Analyzing the Sentiment of the Russian News Coverage of Sweden with Generic and Domain-Specific Multinomial Naive Bayes and Support Vector Machines Classifiers / På västfronten intet gott: attitydanalys av den ryska nyhetsrapporteringen om Sverige med generiska och domänspecifika Multinomial Naive Bayes- och Support Vector Machines-klassificerare Michel, David January 2021 (has links) This thesis explores to what extent Multinomial Naive Bayes (MNB) and Support Vector Machines (SVM) classifiers can be used to determine the polarity of news, specifically the news coverage of Sweden by the Russian state-funded news outlets RT and Sputnik. Three experiments are conducted. In the first experiment, an MNB and an SVM classifier are trained with the Large Movie Review Dataset (Maas et al., 2011) with a varying number of samples to determine how training data size affects classifier performance. In the second experiment, the classifiers are trained with 300 positive, negative, and neutral news articles (Agarwal et al., 2019) and tested on 95 RT and Sputnik news articles about Sweden (Bengtsson, 2019) to determine if the domain specificity of the training data outweighs its limited size. In the third experiment, the movie-trained classifiers are put up against the domain-specific classifiers to determine if well-trained classifiers from another domain perform better than relatively untrained, domain-specific classifiers. Four different types of feature sets (unigrams, unigrams without stop words removal, bigrams, trigrams) were used in the experiments. Some of the model parameters (TF-IDF vs. feature count and SVM’s C parameter) were optimized with 10-fold cross-validation. Other than the superior performance of SVM, the results highlight the need for comprehensive and domain-specific training data when conducting machine learning tasks, as well as the benefits of feature engineering, and to a limited extent, the removal of stop words. Interestingly, the classifiers performed the best on the negative news articles, which made up most of the test set (and possibly of Russian news coverage of Sweden in general). sentiment analysis news sentiment text classification cross-domain sentiment classification domain specificity domain-transfer problem transfer learning knowledge transfer support vector machines SVM multinomial naive Bayes Sweden Aurora 17 Russia Russian news RT Sputnik cyberwarfare influence campaign disinformation fake news propaganda
199	Zero/Few-Shot Text Classification : A Study of Practical Aspects and Applications / Textklassificering med Zero/Few-Shot Learning : En Studie om Praktiska Aspekter och Applikationer Åslund, Jacob January 2021 (has links) SOTA language models have demonstrated remarkable capabilities in tackling NLP tasks they have not been explicitly trained on – given a few demonstrations of the task (few-shot learning), or even none at all (zero-shot learning). The purpose of this Master’s thesis has been to investigate practical aspects and potential applications of zero/few-shot learning in the context of text classification. This includes topics such as combined usage with active learning, automated data labeling, and interpretability. Two different methods for zero/few-shot learning have been investigated, and the results indicate that: • Active learning can be used to marginally improve few-shot performance, but it seems to be mostly beneficial in settings with very few samples (e.g. less than 10). • Zero-shot learning can be used produce reasonable candidate labels for classes in a dataset, given knowledge of the classification task at hand. • It is difficult to trust the predictions of zero-shot text classification without access to a validation dataset, but IML methods such as saliency maps could find usage in debugging zero-shot models. / Ledande språkmodeller har uppvisat anmärkningsvärda förmågor i att lösa NLP-problem de inte blivit explicit tränade på – givet några exempel av problemet (few-shot learning), eller till och med inga alls (zero-shot learning). Syftet med det här examensarbetet har varit att undersöka praktiska aspekter och potentiella tillämpningar av zero/few-shot learning inom kontext av textklassificering. Detta inkluderar kombinerad användning med aktiv inlärning, automatiserad datamärkning, och tolkningsbarhet. Två olika metoder för zero/few-shot learning har undersökts, och resultaten indikerar att: • Aktiv inlärning kan användas för att marginellt förbättra textklassificering med few-shot learning, men detta verkar vara mest fördelaktigt i situationer med väldigt få datapunkter (t.ex. mindre än 10). • Zero-shot learning kan användas för att hitta lämpliga etiketter för klasser i ett dataset, givet kunskap om klassifikationsuppgiften av intresse. • Det är svårt att lita på robustheten i textklassificering med zero-shot learning utan tillgång till valideringsdata, men metoder inom tolkningsbar maskininlärning såsom saliency maps skulle kunna användas för att felsöka zero-shot modeller. zero-shot learning few-shot learning text classification active learning automated data labeling interpretable machine learning deep learning NLP NLU zero-shot learning few-shot learning textklassificering aktiv inlärning automatiserad datamärkning tolkningsbar maskininlärning djupinlärning NLP NLU Computer and Information Sciences Data- och informationsvetenskap
200	Novel statistical approaches to text classification, machine translation and computer-assisted translation Civera Saiz, Jorge 04 July 2008 (has links) Esta tesis presenta diversas contribuciones en los campos de la clasificación automática de texto, traducción automática y traducción asistida por ordenador bajo el marco estadístico. En clasificación automática de texto, se propone una nueva aplicación llamada clasificación de texto bilingüe junto con una serie de modelos orientados a capturar dicha información bilingüe. Con tal fin se presentan dos aproximaciones a esta aplicación; la primera de ellas se basa en una asunción naive que contempla la independencia entre las dos lenguas involucradas, mientras que la segunda, más sofisticada, considera la existencia de una correlación entre palabras en diferentes lenguas. La primera aproximación dió lugar al desarrollo de cinco modelos basados en modelos de unigrama y modelos de n-gramas suavizados. Estos modelos fueron evaluados en tres tareas de complejidad creciente, siendo la más compleja de estas tareas analizada desde el punto de vista de un sistema de ayuda a la indexación de documentos. La segunda aproximación se caracteriza por modelos de traducción capaces de capturar correlación entre palabras en diferentes lenguas. En nuestro caso, el modelo de traducción elegido fue el modelo M1 junto con un modelo de unigramas. Este modelo fue evaluado en dos de las tareas más simples superando la aproximación naive, que asume la independencia entre palabras en differentes lenguas procedentes de textos bilingües. En traducción automática, los modelos estadísticos de traducción basados en palabras M1, M2 y HMM son extendidos bajo el marco de la modelización mediante mixturas, con el objetivo de definir modelos de traducción dependientes del contexto. Asimismo se extiende un algoritmo iterativo de búsqueda basado en programación dinámica, originalmente diseñado para el modelo M2, para el caso de mixturas de modelos M2. Este algoritmo de búsqueda n / Civera Saiz, J. (2008). Novel statistical approaches to text classification, machine translation and computer-assisted translation [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2502 Mixture modelling Em algorithm Bilingual text classification Machine-aided indexing Statistical machine translation Stochastic finite-state transducer Computer-assisted translation Word-based translation models N-gram language models LENGUAJES Y SISTEMAS INFORMATICOS 120304 - Inteligencia artificial 120317 - Informática

Search results