191 |
[pt] EXTRAÇÃO DE INFORMAÇÕES DE SENTENÇAS JUDICIAIS EM PORTUGUÊS / [en] INFORMATION EXTRACTION FROM LEGAL OPINIONS IN BRAZILIAN PORTUGUESEGUSTAVO MARTINS CAMPOS COELHO 03 October 2022 (has links)
[pt] A Extração de Informação é uma tarefa importante no domínio jurídico.
Embora a presença de dados estruturados seja escassa, dados não estruturados na forma de documentos jurídicos, como sentenças, estão amplamente
disponíveis. Se processados adequadamente, tais documentos podem fornecer
informações valiosas sobre processos judiciais anteriores, permitindo uma melhor avaliação por profissionais do direito e apoiando aplicativos baseados em
dados. Este estudo aborda a Extração de Informação no domínio jurídico, extraindo valor de sentenças relacionados a reclamações de consumidores. Mais
especificamente, a extração de cláusulas categóricas é abordada através de
classificação, onde seis modelos baseados em diferentes estruturas são analisados. Complementarmente, a extração de valores monetários relacionados a
indenizações por danos morais é abordada por um modelo de Reconhecimento
de Entidade Nomeada. Para avaliação, um conjunto de dados foi criado, contendo 964 sentenças anotados manualmente (escritas em português) emitidas
por juízes de primeira instância. Os resultados mostram uma média de aproximadamente 97 por cento de acurácia na extração de cláusulas categóricas, e 98,9 por cento
na aplicação de NER para a extração de indenizações por danos morais. / [en] Information Extraction is an important task in the legal domain. While
the presence of structured and machine-processable data is scarce, unstructured data in the form of legal documents, such as legal opinions, is largely
available. If properly processed, such documents can provide valuable information with regards to past lawsuits, allowing better assessment by legal professionals and supporting data-driven applications. This study addresses Information Extraction in the legal domain by extracting value from legal opinions
related to consumer complaints. More specifically, the extraction of categorical
provisions is addressed by classification, where six models based on different
frameworks are analyzed. Moreover, the extraction of monetary values related
to moral damage compensations is addressed by a Named Entity Recognition
(NER) model. For evaluation, a dataset was constructed, containing 964 manually annotated legal opinions (written in Brazilian Portuguese) enacted by
lower court judges. The results show an average of approximately 97 percent of accuracy when extracting categorical provisions, and 98.9 percent when applying NER
for the extraction of moral damage compensations.
|
192 |
The Struggle Against Misinformation: Evaluating the Performance of Basic vs. Complex Machine Learning Models on Manipulated DataValladares Parker, Diego Gabriel January 2024 (has links)
This study investigates the application of machine learning (ML) techniques in detecting fake news, addressing the rapid spread of misinformation across social media platforms. Given the time-consuming nature of manual fact-checking, this research compares the robustness of basic machine learning models, such as Multinominal Naive Bayes classifiers, with complex models like Distil-BERT in identifying fake news. Utilizing datasets including LIAR, ISOT, and GM, this study will evaluate these models based on standard classification metrics both in single domain and cross-domain scenarios, especially when processing linguistically manipulated data. Results indicate that while complex models like Distil-BERT perform better in single-domain classifications, the Baseline models show competitive performance in cross-domain and on the manipulated dataset. However both models struggle with the manipulated dataset, highlighting a critical area for improvement in fake news detection algorithms and methods. In conclusion, the findings suggest that while both basic and complex models have their strength in certain settings, significant advancements are needed to improve against linguistic manipulations, ensuring reliable detection of fake news across varied contexts before consideration of public availability of automated classification.
|
193 |
Variações do método kNN e suas aplicações na classificação automática de textos / kNN Method Variations and its applications in Text ClassificationSANTOS, Fernando Chagas 10 October 2010 (has links)
Made available in DSpace on 2014-07-29T14:57:46Z (GMT). No. of bitstreams: 1
dissertacao-fernando.pdf: 677510 bytes, checksum: 19704f0b04ee313a63b053f7f9df409c (MD5)
Previous issue date: 2010-10-10 / Most research on Automatic Text Categorization (ATC) seeks to improve the classifier
performance (effective or efficient) responsible for automatically classifying a document
d not yet rated. The k nearest neighbors (kNN) is simpler and it s one of automatic
classification methods more effective as proposed. In this paper we proposed two kNN
variations, Inverse kNN (kINN) and Symmetric kNN (kSNN) with the aim of improving
the effectiveness of ACT. The kNN, kINN and kSNN methods were applied in Reuters,
20ng and Ohsumed collections and the results showed that kINN and kSNN methods
were more effective than kNN method in Reuters and Ohsumed collections. kINN and
kSNN methods were as effective as kNN method in 20NG collection. In addition, the
performance achieved by kNN method is more stable than kINN and kSNN methods
when the value k change. A parallel study was conducted to generate new features in
documents from the similarity matrices resulting from the selection criteria for the best
results obtained in kNN, kINN and kSNN methods. The SVM (considered a state of the
art method) was applied in Reuters, 20NG and Ohsumed collections - before and after
applying this approach to generate features in these documents and the results showed
statistically significant gains for the original collection. / Grande parte das pesquisas relacionadas com a classificação automática de textos (CAT)
tem procurado melhorar o desempenho (eficácia ou eficiência) do classificador responsável
por classificar automaticamente um documento d, ainda não classificado. O método
dos k vizinhos mais próximos (kNN, do inglês k nearest neighbors) é um dos métodos
de classificação automática mais simples e eficazes já propostos. Neste trabalho foram
propostas duas variações do método kNN, o kNN invertido (kINN) e o kNN simétrico
(kSNN) com o objetivo de melhorar a eficácia da CAT. Os métodos kNN, kINN e kSNN
foram aplicados nas coleções Reuters, 20NG e Ohsumed e os resultados obtidos demonstraram
que os métodos kINN e kSNN tiveram eficácia superior ao método kNN ao serem
aplicados nas coleções Reuters e Ohsumed e eficácia equivalente ao método kNN ao serem
aplicados na coleção 20NG. Além disso, nessas coleções foi possível verificar que o
desempenho obtido pelo método kNN é mais estável a variação do valor k do que os desempenhos
obtidos pelos métodos kINN e kSNN. Um estudo paralelo foi realizado para
gerar novas características em documentos a partir das matrizes de similaridade resultantes
dos critérios de seleção dos melhores resultados obtidos na avaliação dos métodos
kNN, kINN e kSNN. O método SVM, considerado um método de classificação do estado
da arte em relação à eficácia, foi aplicado nas coleções Reuters, 20NG e Ohsumed - antes
e após aplicar a abordagem de geração de características nesses documentos e os resultados
obtidos demonstraram ganhos estatisticamente significativos em relação à coleção
original.
|
194 |
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competitionTsatsaronis, George 10 October 2017 (has links) (PDF)
This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies.
|
195 |
Classification automatique de textes pour les revues de littérature mixtes en santéLanglois, Alexis 12 1900 (has links)
Les revues de littérature sont couramment employées en sciences de la santé pour justifier et interpréter les résultats d’un ensemble d’études. Elles permettent également aux chercheurs, praticiens et décideurs de demeurer à jour sur les connaissances. Les revues dites systématiques mixtes produisent un bilan des meilleures études portant sur un même sujet tout en considérant l’ensemble des méthodes de recherche quantitatives et qualitatives. Leur production est ralentie par la prolifération des publications dans les bases de données bibliographiques et la présence accentuée de travaux non scientifiques comme les éditoriaux et les textes d’opinion. Notamment, l’étape d’identification des études pertinentes pour l’élaboration de telles revues s’avère laborieuse et requiert un temps considérable. Traditionnellement, le triage s’effectue en utilisant un ensemble de règles établies manuellement. Dans cette étude, nous explorons la possibilité d’utiliser la classification automatique pour exécuter cette tâche.
La famille d’algorithmes ayant été considérée dans le comparatif de ce travail regroupe les arbres de décision, la classification naïve bayésienne, la méthode des k plus proches voisins, les machines à vecteurs de support ainsi que les approches par votes. Différentes méthodes de combinaison de caractéristiques exploitant les termes numériques, les symboles ainsi que les synonymes ont été comparés. La pertinence des concepts issus d’un méta-thésaurus a également été mesurée.
En exploitant les résumés et les titres d’approximativement 10 000 références, les forêts d’arbres de décision admettent le plus haut taux de succès (88.76%), suivies par les machines à vecteurs de support (86.94%). L’efficacité de ces approches devance la performance des filtres booléens conçus pour les bases de données bibliographiques. Toutefois, une sélection judicieuse des entrées de la collection d’entraînement est cruciale pour pallier l’instabilité du modèle final et la disparité des méthodologies quantitatives et qualitatives des études scientifiques existantes. / The interest of health researchers and policy-makers in literature reviews has continued to increase over the years. Mixed studies reviews are highly valued since they combine results from the best available studies on various topics while considering quantitative, qualitative and mixed research methods. These reviews can be used for several purposes such as justifying, designing and interpreting results of primary studies. Due to the proliferation of published papers and the growing number of nonempirical works such as editorials and opinion letters, screening records for mixed studies reviews is time consuming. Traditionally, reviewers are required to manually identify potential relevant studies. In order to facilitate this process, a comparison of different automated text classification methods was conducted in order to determine the most effective and robust approach to facilitate systematic mixed studies reviews.
The group of algorithms considered in this study combined decision trees, naive Bayes classifiers, k-nearest neighbours, support vector machines and voting approaches. Statistical techniques were applied to assess the relevancy of multiple features according to a predefined dataset. The benefits of feature combination for numerical terms, synonyms and mathematical symbols were also measured. Furthermore, concepts extracted from a metathesaurus were used as additional features in order to improve the training process.
Using the titles and abstracts of approximately 10,000 entries, decision trees perform the best with an accuracy of 88.76%, followed by support vector machine (86.94%). The final model based on decision trees relies on linear interpolation and a group of concepts extracted from a metathesaurus. This approach outperforms the mixed filters commonly used with bibliographic databases like MEDLINE. However, references chosen for training must be selected judiciously in order to address the model instability and the disparity of quantitative and qualitative study designs.
|
196 |
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competitionTsatsaronis, George 10 October 2017 (has links)
This article provides an overview of the first BioASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BioASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies.
|
197 |
Inteligentní emailová schránka / Intelligent MailboxPohlídal, Antonín January 2012 (has links)
This master's thesis deals with the use of text classification for sorting of incoming emails. First, there is described the Knowledge Discovery in Databases and there is also analyzed in detail the text classification with selected methods. Further, this thesis describes the email communication and SMTP, POP3 and IMAP protocols. The next part contains design of the system that classifies incoming emails and there are also described realated technologie ie Apache James Server, PostgreSQL and RapidMiner. Further, there is described the implementation of all necessary components. The last part contains an experiments with email server using Enron Dataset.
|
198 |
All Negative on the Western Front: Analyzing the Sentiment of the Russian News Coverage of Sweden with Generic and Domain-Specific Multinomial Naive Bayes and Support Vector Machines Classifiers / På västfronten intet gott: attitydanalys av den ryska nyhetsrapporteringen om Sverige med generiska och domänspecifika Multinomial Naive Bayes- och Support Vector Machines-klassificerareMichel, David January 2021 (has links)
This thesis explores to what extent Multinomial Naive Bayes (MNB) and Support Vector Machines (SVM) classifiers can be used to determine the polarity of news, specifically the news coverage of Sweden by the Russian state-funded news outlets RT and Sputnik. Three experiments are conducted. In the first experiment, an MNB and an SVM classifier are trained with the Large Movie Review Dataset (Maas et al., 2011) with a varying number of samples to determine how training data size affects classifier performance. In the second experiment, the classifiers are trained with 300 positive, negative, and neutral news articles (Agarwal et al., 2019) and tested on 95 RT and Sputnik news articles about Sweden (Bengtsson, 2019) to determine if the domain specificity of the training data outweighs its limited size. In the third experiment, the movie-trained classifiers are put up against the domain-specific classifiers to determine if well-trained classifiers from another domain perform better than relatively untrained, domain-specific classifiers. Four different types of feature sets (unigrams, unigrams without stop words removal, bigrams, trigrams) were used in the experiments. Some of the model parameters (TF-IDF vs. feature count and SVM’s C parameter) were optimized with 10-fold cross-validation. Other than the superior performance of SVM, the results highlight the need for comprehensive and domain-specific training data when conducting machine learning tasks, as well as the benefits of feature engineering, and to a limited extent, the removal of stop words. Interestingly, the classifiers performed the best on the negative news articles, which made up most of the test set (and possibly of Russian news coverage of Sweden in general).
|
199 |
Zero/Few-Shot Text Classification : A Study of Practical Aspects and Applications / Textklassificering med Zero/Few-Shot Learning : En Studie om Praktiska Aspekter och ApplikationerÅslund, Jacob January 2021 (has links)
SOTA language models have demonstrated remarkable capabilities in tackling NLP tasks they have not been explicitly trained on – given a few demonstrations of the task (few-shot learning), or even none at all (zero-shot learning). The purpose of this Master’s thesis has been to investigate practical aspects and potential applications of zero/few-shot learning in the context of text classification. This includes topics such as combined usage with active learning, automated data labeling, and interpretability. Two different methods for zero/few-shot learning have been investigated, and the results indicate that: • Active learning can be used to marginally improve few-shot performance, but it seems to be mostly beneficial in settings with very few samples (e.g. less than 10). • Zero-shot learning can be used produce reasonable candidate labels for classes in a dataset, given knowledge of the classification task at hand. • It is difficult to trust the predictions of zero-shot text classification without access to a validation dataset, but IML methods such as saliency maps could find usage in debugging zero-shot models. / Ledande språkmodeller har uppvisat anmärkningsvärda förmågor i att lösa NLP-problem de inte blivit explicit tränade på – givet några exempel av problemet (few-shot learning), eller till och med inga alls (zero-shot learning). Syftet med det här examensarbetet har varit att undersöka praktiska aspekter och potentiella tillämpningar av zero/few-shot learning inom kontext av textklassificering. Detta inkluderar kombinerad användning med aktiv inlärning, automatiserad datamärkning, och tolkningsbarhet. Två olika metoder för zero/few-shot learning har undersökts, och resultaten indikerar att: • Aktiv inlärning kan användas för att marginellt förbättra textklassificering med few-shot learning, men detta verkar vara mest fördelaktigt i situationer med väldigt få datapunkter (t.ex. mindre än 10). • Zero-shot learning kan användas för att hitta lämpliga etiketter för klasser i ett dataset, givet kunskap om klassifikationsuppgiften av intresse. • Det är svårt att lita på robustheten i textklassificering med zero-shot learning utan tillgång till valideringsdata, men metoder inom tolkningsbar maskininlärning såsom saliency maps skulle kunna användas för att felsöka zero-shot modeller.
|
200 |
Novel statistical approaches to text classification, machine translation and computer-assisted translationCivera Saiz, Jorge 04 July 2008 (has links)
Esta tesis presenta diversas contribuciones en los campos de la
clasificación automática de texto, traducción automática y traducción
asistida por ordenador bajo el marco estadístico.
En clasificación automática de texto, se propone una nueva aplicación
llamada clasificación de texto bilingüe junto con una serie de modelos
orientados a capturar dicha información bilingüe. Con tal fin se
presentan dos aproximaciones a esta aplicación; la primera de ellas se
basa en una asunción naive que contempla la independencia entre las
dos lenguas involucradas, mientras que la segunda, más sofisticada,
considera la existencia de una correlación entre palabras en
diferentes lenguas. La primera aproximación dió lugar al desarrollo de
cinco modelos basados en modelos de unigrama y modelos de n-gramas
suavizados. Estos modelos fueron evaluados en tres tareas de
complejidad creciente, siendo la más compleja de estas tareas
analizada desde el punto de vista de un sistema de ayuda a la
indexación de documentos. La segunda aproximación se caracteriza por
modelos de traducción capaces de capturar correlación entre palabras
en diferentes lenguas. En nuestro caso, el modelo de traducción
elegido fue el modelo M1 junto con un modelo de unigramas. Este
modelo fue evaluado en dos de las tareas más simples superando la
aproximación naive, que asume la independencia entre palabras en
differentes lenguas procedentes de textos bilingües.
En traducción automática, los modelos estadísticos de traducción
basados en palabras M1, M2 y HMM son extendidos bajo el marco de la
modelización mediante mixturas, con el objetivo de definir modelos de
traducción dependientes del contexto. Asimismo se extiende un
algoritmo iterativo de búsqueda basado en programación dinámica,
originalmente diseñado para el modelo M2, para el caso de mixturas de
modelos M2. Este algoritmo de búsqueda n / Civera Saiz, J. (2008). Novel statistical approaches to text classification, machine translation and computer-assisted translation [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2502
|
Page generated in 0.0343 seconds