Spelling suggestions: "subject:"batural language aprocessing"" "subject:"batural language eprocessing""
211 |
A Hybrid Approach to General Information ExtractionGrap, Marie Belen 01 September 2015 (has links)
Information Extraction (IE) is the process of analyzing documents and identifying desired pieces of information within them. Many IE systems have been developed over the last couple of decades, but there is still room for improvement as IE remains an open problem for researchers. This work discusses the development of a hybrid IE system that attempts to combine the strengths of rule-based and statistical IE systems while avoiding their unique pitfalls in order to achieve high performance for any type of information on any type of document. Test results show that this system operates competitively in cases where target information belongs to a highly-structured data type and when critical contextual information is in close proximity to the target.
|
212 |
CREATE: Clinical Record Analysis Technology EnsembleEglowski, Skylar 01 June 2017 (has links)
In this thesis, we describe an approach that won a psychiatric symptom severity prediction challenge. The challenge was to correctly predict the severity of psychiatric symptoms on a 4-point scale. Our winning submission uses a novel stacked machine learning architecture in which (i) a base data ingestion/cleaning step was followed by the (ii) derivation of a base set of features defined using text analytics, after which (iii) association rule learning was used in a novel way to generate new features, followed by a (iv) feature selection step to eliminate irrelevant features, followed by a (v) classifier training algorithm in which a total of 22 classifiers including new classifier variants of AdaBoost and RandomForest were trained on seven different data views, and (vi) finally an ensemble learning step, in which ensembles of best learners were used to improve on the accuracy of individual learners. All of this was tested via standard 10-fold cross-validation on training data provided by the N-GRID challenge organizers, of which the three best ensembles were selected for submission to N-GRID's blind testing. The best of our submitted solutions garnered an overall final score of 0.863 according to the organizer's measure. All 3 of our submissions placed within the top 10 out of the 65 total submissions. The challenge constituted Track 2 of the 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDOC Individualized Domains (N-GRID) Shared Task in Clinical Natural Language Processing.
|
213 |
Modeli srpskog jezika i njihova primena u govornim i jezičkim tehnologijama / Models of the Serbian language and their application in speech and language technologiesOstrogonac Stevan 21 December 2018 (has links)
<p>Statistički jezički model, u teoriji, predstavlja raspodelu verovatnoća nad skupom svih mogućih sekvenci reči nekog jezika. U praksi, to je mehanizam kojim se estimiraju verovatnoće sekvenci, koje su od interesa. Matematički aparat vezan za modele jezika je uglavnom nezavisan od jezika. Međutim, kvalitet obučenih modela ne zavisi samo od algoritama obuke, već prvenstveno od količine i kvaliteta podataka koji su na raspolaganju za obuku. Za jezike sa kompleksnom morfologijom, kao što je srpski, tekstualni korpus za obuku modela mora biti daleko obimniji od korpusa koji bi se koristio kod nekog od jezika sa relativno jednostavnom morfologijom, poput engleskog. Ovo istraživanje obuhvata razvoj jezičkih modela za srpski jezik, počevši od prikupljanja i inicijalne obrade tekstualnih sadržaja, preko adaptacije algoritama i razvoja metoda za rešavanje problema nedovoljne količine podataka za obuku, pa do prilagođavanja i primene modela u različitim tehnologijama, kao što su sinteza govora na osnovu teksta, automatsko prepoznavanje govora, automatska detekcija i korekcija gramatičkih i semantičkih grešaka u tekstovima, a postavljaju se i osnove za primenu jezičkih modela u automatskoj klasifikaciji dokumenata i drugim tehnologijama. Jezgro razvoja jezičkih modela za srpski predstavlja definisanje morfoloških klasa reči na osnovu informacija koje su sadržane u morfološkom rečniku, koji je nastao kao rezultat jednog od ranijih istraživanja.</p> / <p>A statistical language model, in theory, represents a probability distribution over sequences of words of a language. In practice, it is a tool for estimating probabilities of word sequences of interest. Mathematical basis related to language models is mostly language independent. However, the quality of trained models depends not only on training algorithms, but on the amount and quality of available training data as well. For languages with complex morphology, such as Serbian, textual corpora for training language models need to be significantly larger than the corpora needed for training language models for languages with relatively simple morphology, such as English. This research represents the entire process of developing language models for Serbian, starting with collecting and preprocessing of textual contents, extending to adaptation of algorithms and development of methods for addressing the problem of insufficient training data, and finally to adaptation and application of the models in different technologies, such as text-to-speech synthesis, automatic speech recognition, automatic detection and correction of grammar and semantic errors in texts, and determining basics for the application of the models in automatic document classification and other tasks. The core of the development of language models for Serbian is defining morphologic classes of words, based on the information contained within the morphologic dictionary of Serbian, which was one of the results of a previous research.</p>
|
214 |
Multi-Perspective Semantic Information Retrieval in the Biomedical DomainJanuary 2020 (has links)
abstract: Information Retrieval (IR) is the task of obtaining pieces of data (such as documents or snippets of text) that are relevant to a particular query or need from a large repository of information. IR is a valuable component of several downstream Natural Language Processing (NLP) tasks, such as Question Answering. Practically, IR is at the heart of many widely-used technologies like search engines.
While probabilistic ranking functions, such as the Okapi BM25 function, have been utilized in IR systems since the 1970's, modern neural approaches pose certain advantages compared to their classical counterparts. In particular, the release of BERT (Bidirectional Encoder Representations from Transformers) has had a significant impact in the NLP community by demonstrating how the use of a Masked Language Model (MLM) trained on a considerable corpus of data can improve a variety of downstream NLP tasks, including sentence classification and passage re-ranking.
IR Systems are also important in the biomedical and clinical domains. Given the continuously-increasing amount of scientific literature across biomedical domain, the ability find answers to specific clinical queries from a repository of millions of articles is a matter of practical value to medics, doctors, and other medical professionals. Moreover, there are domain-specific challenges present in the biomedical domain, including handling clinical jargon and evaluating the similarity or relatedness of various medical symptoms when determining the relevance between a query and a sentence.
This work presents contributions to several aspects of the Biomedical Semantic Information Retrieval domain. First, it introduces Multi-Perspective Sentence Relevance, a novel methodology of utilizing BERT-based models for contextual IR. The system is evaluated using the BioASQ Biomedical IR Challenge. Finally, practical contributions in the form of a live IR system for medics and a proposed challenge on the Living Systematic Review clinical task are provided. / Dissertation/Thesis / Masters Thesis Computer Science 2020
|
215 |
Interpretability for Deep Learning Text ClassifiersLucaci, Diana 14 December 2020 (has links)
The ubiquitous presence of automated decision-making systems that have a performance
comparable to humans brought attention towards the necessity of interpretability for the
generated predictions. Whether the goal is predicting the system’s behavior when the
input changes, building user trust, or expert assistance in improving the machine learning
methods, interpretability is paramount when the problem is not sufficiently validated in
real applications, and when unacceptable results lead to significant consequences.
While for humans, there are no standard interpretations for the decisions they make,
the complexity of the systems with advanced information-processing capacities conceals
the detailed explanations for individual predictions, encapsulating them under layers of
abstractions and complex mathematical operations. Interpretability for deep learning classifiers becomes, thus, a challenging research topic where the ambiguity of the problem
statement allows for multiple exploratory paths.
Our work focuses on generating natural language interpretations for individual predictions of deep learning text classifiers. We propose a framework for extracting and
identifying the phrases of the training corpus that influence the prediction confidence the
most through unsupervised key phrase extraction and neural predictions. We assess the
contribution margin that the added justification has when the deep learning model predicts
the class probability of a text instance, by introducing and defining a contribution metric
that allows one to quantify the fidelity of the explanation to the model. We assess both
the performance impact of the proposed approach on the classification task as quantitative
analysis and the quality of the generated justifications through extensive qualitative and
error analysis.
This methodology manages to capture the most influencing phrases of the training corpus as explanations that reveal the linguistic features used for individual test predictions,
allowing humans to predict the behavior of the deep learning classifier.
|
216 |
Preprocessing method comparison and model tuning for natural language dataTempfli, Peter January 2020 (has links)
Twitter and other microblogging services are a valuable source for almost real-time marketing, public opinion and brand-related consumer information mining. As such, collection and analysis of user-generated natural language content is in the focus of research regarding automated sentiment analysis. The most successful approach in the field is supervised machine learning, where the three key problems are data cleaning and transformation, feature generation and model choice and training parameter selection. Papers in recent years thoroughly examined the field and there is a agreement that relatively simple techniques as bag-of-words transformation of text and a naive bayes models can generate acceptable results (between 75% and 85% percent F1-scores for an average dataset) and fine tuning can be really difficult and yields relatively small results. However, a few percent in performance even on a middle-size dataset can mean thousands of better classified documents, which can mean thousands of missed sales or angry customers in any business domain. Thus this work presents and demonstrates a framework for better tailored, fine-tuned models for analysing twitter data. The experiments show that Naive Bayes classifiers with domain specific stopword selection work the best (up to 88% F1-score), however the performance dramatically decreases if the data is unbalanced or the classes are not binary. Filtering stopwords is crucial to increase prediction performance; and the experiment shows that a stopword set should be domain-specific. The conclusion is that there is no one best way for model training and stopword selection in sentiment analysis. Thus the work suggests that there is space for using a comparison framework to fine-tune prediction models to a given problem: such a comparison framework should compare different training settings on the same dataset, so the best trained models can be found for a given real-life problem.
|
217 |
Comparing Pso-Based Clustering Over Contextual Vector Embeddings to Modern Topic ModelingMiles, Samuel 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Efficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In this thesis, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on three datasets across different domains. The first dataset consists of posts from the online health forum r/Cancer. The second dataset is a collection of NY Times abstracts and is used to compare
|
218 |
Natural Language Processing of StoriesRittichier, Kaley J. 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / In this thesis, I deal with the task of computationally processing stories with a focus on multidisciplinary ends, specifically in Digital Humanities and Cultural Analytics. In the process, I collect, clean, investigate, and predict from two datasets. The first is a dataset of 2,302 open-source literary works categorized by the time period they are set in. These works were all collected from Project Gutenberg. The classification of the time period in which the work is set was discovered by collecting and inspecting Library of Congress subject classifications, Wikipedia Categories, and literary factsheets from SparkNotes. The second is a dataset of 6,991 open-source literary works categorized by the hierarchical location the work is set in; these labels were constructed from Library of Congress subject classifications and SparkNotes factsheets. These datasets are the first of their kind and can help move forward an understanding of 1) the presentation of settings in stories and 2) the effect the settings have on our understanding of the stories.
|
219 |
Exploration of Visual, Acoustic, and Physiological Modalities to Complement Linguistic Representations for Sentiment AnalysisPérez-Rosas, Verónica 12 1900 (has links)
This research is concerned with the identification of sentiment in multimodal content. This is of particular interest given the increasing presence of subjective multimodal content on the web and other sources, which contains a rich and vast source of people's opinions, feelings, and experiences. Despite the need for tools that can identify opinions in the presence of diverse modalities, most of current methods for sentiment analysis are designed for textual data only, and few attempts have been made to address this problem. The dissertation investigates techniques for augmenting linguistic representations with acoustic, visual, and physiological features. The potential benefits of using these modalities include linguistic disambiguation, visual grounding, and the integration of information about people's internal states. The main goal of this work is to build computational resources and tools that allow sentiment analysis to be applied to multimodal data. This thesis makes three important contributions. First, it shows that modalities such as audio, video, and physiological data can be successfully used to improve existing linguistic representations for sentiment analysis. We present a method that integrates linguistic features with features extracted from these modalities. Features are derived from verbal statements, audiovisual recordings, thermal recordings, and physiological sensors signals. The resulting multimodal sentiment analysis system is shown to significantly outperform the use of language alone. Using this system, we were able to predict the sentiment expressed in video reviews and also the sentiment experienced by viewers while exposed to emotionally loaded content. Second, the thesis provides evidence of the portability of the developed strategies to other affect recognition problems. We provided support for this by studying the deception detection problem. Third, this thesis contributes several multimodal datasets that will enable further research in sentiment and deception detection.
|
220 |
Extracting Temporally-Anchored Spatial KnowledgeVempala, Alakananda 05 1900 (has links)
In my dissertation, I elaborate on the work that I have done to extract temporally-anchored spatial knowledge from text, including both intra- and inter-sentential knowledge. I also detail multiple approaches to infer spatial timeline of a person from biographies and social media. I present and analyze two strategies to annotate information regarding whether a given entity is or is not located at some location, and for how long with respect to an event. Specifically, I leverage semantic roles or syntactic dependencies to generate potential spatial knowledge and then crowdsource annotations to validate the potential knowledge. The resulting annotations indicate how long entities are or are not located somewhere, and temporally anchor this spatial information. I present an in-depth corpus analysis and experiments comparing the spatial knowledge generated by manipulating roles or dependencies. In my work, I also explore research methodologies that go beyond single sentences and extract spatio-temporal information from text. Spatial timelines refer to a chronological order of locations where a target person is or is not located. I present corpus and experiments to extract spatial timelines from Wikipedia biographies. I present my work on determining locations and the order in which they are actually visited by a person from their travel experiences. Specifically, I extract spatio-temporal graphs that capture the order (edges) of locations (nodes) visited by a person. Further, I detail my experiments that leverage both text and images to extract spatial timeline of a person from Twitter.
|
Page generated in 0.1005 seconds