Spelling suggestions: "subject:"batural language aprocessing"" "subject:"batural language eprocessing""
231 |
A Mixed Methods Study of Ranger Attrition: Examining the Relationship of Candidate Attitudes, Attributions and GoalsCoombs, Aaron Keith 01 May 2023 (has links)
Elite military selection programs like the 75th Ranger Regiment's Ranger Assessment and Selection Program (RASP) are known for their difficulty and high attrition rates, despite substantial candidate screening just to get into such programs. The current study analyzes Ranger candidates 'attitudes, attributions, and goals (AAGs) and their relationship with successful completion of pre-RASP, a preparation phase for the demanding eight-week RASP program. Candidates' entry and exit surveys were analyzed using natural language processing (NLP), as well as more traditional statistical analyses of Likert-measured survey items to determine what reasons for joining and what individual goals related to candidate success. Candidates' Intrinsic Motivations and Satisfaction as measured on entry surveys were the strongest predictors of success. Specifically, candidates' desire to deploy or serve in combat, and the goal of earning credibility in the Rangers were the most important reasons and goals provided through candidates' open-text responses. Additionally, between-groups analyses between Black Candidates, Hispanic Candidates, and White Candidates showed that differences in candidate abilities and motivations better explains pre-RASP attrition than demographic proxies such as race or ethnicity. The study's use of NLP demonstrates the practical utility of applying machine learning to quantitatively analyze open-text responses that have traditionally been limited to qualitative analysis or subject to human coding, although predictive models utilizing more traditional Likert-measurement of AAGs had better predictive accuracy. / Doctor of Philosophy / Elite military selection programs like the 75th Ranger Regiment's Ranger Assessment and Selection Program (RASP) are known for their difficulty and high attrition rates, despite substantial candidate screening just to get into such programs. The current study analyzes Ranger candidates' attitudes and goals and their relationship with successful completion of pre-RASP, a preparation phase for the demanding eight-week RASP program. Candidates' entry and exit surveys were analyzed to better understand the relationship between candidates' reasons for volunteering and their goals in the organization. Candidates' Intrinsic Motivations and their Satisfaction upon arrival for pre-RASP best predicted candidate success. Specifically, candidates' desires to deploy or serve in combat, and the goal of earning credibility in the Rangers were the most important reasons and goals provided through candidates' open-text responses. Additionally, between-groups analyses between Black Candidates, Hispanic Candidates, and White Candidates showed that differences in candidate abilities and motivations better explains pre-RASP attrition than demographic proxies such as race or ethnicity.
|
232 |
Mutual Learning Algorithms in Machine LearningChowdhury, Sabrina Tarin 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Mutual learning algorithm is a machine learning algorithm where multiple machine learning algorithms learns from different sources and then share their knowledge among themselves so that all the agents can improve their classification and prediction accuracies simultaneously. Mutual learning algorithm can be an efficient mechanism for improving the machine learning and neural network efficiency in a multi-agent system. Usually, in knowledge distillation algorithms, a big network plays the role of a static teacher and passes the data to smaller networks, known as student networks, to improve the efficiency of the latter. In this thesis, it is showed that two small networks can dynamically and interchangeably play the changing roles of teacher and student to share their knowledge and hence, the efficiency of both the networks improve simultaneously. This type of dynamic learning mechanism can be very useful in mobile environment where there is resource constraint for training with big dataset. Data exchange in multi agent, teacher-student network system can lead to efficient learning. The concept and the proposed mutual learning algorithm are demonstrated using convolutional neural networks (CNNs) and Support Vector Machine (SVM) to recognize the pattern recognition problem using MNIST hand-writing dataset.
The concept of machine learning is applied in the field of natural language processing (NLP) too. Machines with basic understanding of human language are getting increasingly popular in day-to-day life. Therefore, NLP-enabled machines with memory efficient training can potentially become an indispensable part of our life in near future. A classic problem in the field of NLP is news classification problem where news articles from newspapers are classified by news categories by machine learning algorithms. In this thesis, we show news classification implemented using Naïve Bayes and support vector machine (SVM) algorithm. Then we show two small networks can dynamically play the changing roles of teacher and student to share their knowledge on news classification and hence, the efficiency of both the networks improves simultaneously. The mutual learning algorithm is applied between homogenous algorithms first, i.e., between two Naive Bayes algorithms and two SVM algorithms. Then the mutual learning is demonstrated between heterogenous agents, i.e., between one Naïve Bayes and one SVM agent and the relative efficiency increase between the agents is discussed before and after mutual learning. / 2025-04-04
|
233 |
Evaluation of Automatic Text Summarization Using Synthetic FactsAhn, Jaewook 01 June 2022 (has links) (PDF)
Automatic text summarization has achieved remarkable success with the development of deep neural networks and the availability of standardized benchmark datasets. It can generate fluent, human-like summaries. However, the unreliability of the existing evaluation metrics hinders its practical usage and slows down its progress. To address this issue, we propose an automatic reference-less text summarization evaluation system with dynamically generated synthetic facts. We hypothesize that if a system guarantees a summary that has all the facts that are 100% known in the synthetic document, it can provide natural interpretability and high feasibility in measuring factual consistency and comprehensiveness. To our knowledge, our system is the first system that measures the overarching quality of the text summarization models with factual consistency, comprehensiveness, and compression rate. We validate our system by comparing its correlation with human judgment with existing N-gram overlap-based metrics such as ROUGE and BLEU and a BERT-based evaluation metric, BERTScore. Our system's experimental evaluation of PEGASUS, BART, and T5 outperforms the current evaluation metrics in measuring factual consistency with a noticeable margin and demonstrates its statistical significance in measuring comprehensiveness and overall summary quality.
|
234 |
Improving Vulnerability Description Using Natural Language GenerationAlthebeiti, Hattan 01 January 2023 (has links) (PDF)
Software plays an integral role in powering numerous everyday computing gadgets. As our reliance on software continues to grow, so does the prevalence of software vulnerabilities, with significant implications for organizations and users. As such, documenting vulnerabilities and tracking their development becomes crucial. Vulnerability databases addressed this issue by storing a record with various attributes for each discovered vulnerability. However, their contents suffer several drawbacks, which we address in our work. In this dissertation, we investigate the weaknesses associated with vulnerability descriptions in public repositories and alleviate such weaknesses through Natural Language Processing (NLP) approaches. The first contribution examines vulnerability descriptions in those databases and approaches to improve them. We propose a new automated method leveraging external sources to enrich the scope and context of a vulnerability description. Moreover, we exploit fine-tuned pretrained language models for normalizing the resulting description. The second contribution investigates the need for uniform and normalized structure in vulnerability descriptions. We address this need by breaking the description of a vulnerability into multiple constituents and developing a multi-task model to create a new uniform and normalized summary that maintains the necessary attributes of the vulnerability using the extracted features while ensuring a consistent vulnerability description. Our method proved effective in generating new summaries with the same structure across a collection of various vulnerability descriptions and types. Our final contribution investigates the feasibility of assigning the Common Weakness Enumeration (CWE) attribute to a vulnerability based on its description. CWE offers a comprehensive framework that categorizes similar exposures into classes, representing the types of exploitation associated with such vulnerabilities. Our approach utilizing pre-trained language models is shown to outperform Large Language Model (LLM) for this task. Overall, this dissertation provides various technical approaches exploiting advances in NLP to improve publicly available vulnerability databases.
|
235 |
Disambiguating natural language via aligning meaningful descriptionsXin, Yida 07 February 2024 (has links)
Artificial Intelligence (AI) technologies are increasingly pervading aspects of our lives. Because people use natural language to communicate with each other, computers should also use natural language to communicate with us. One of the principal obstacles to achieving this is the ambiguity of natural language, evidenced in problems such as prepositional phrase attachment and pronoun coreference. Current methods rely on the statistical frequency of word patterns, but this is often brittle and opaque to people.
In this thesis, I explore the idea of using commonsense knowledge to resolve linguistic ambiguities. I introduce PatchComm, which invokes explicit commonsense assertions to solve context-independent ambiguities. When commonsense assertions are missing, I invoke RetroGAN-DRD, which leverages state-of-the-art inference techniques such as retrofitting and generative adversarial networks (GAN) to infer commonsense assertions. I build upon that with ProGeneXP, which brings state-of-the-art language models to the task of describing its inputs and implicit knowledge in natural language while providing meaningful descriptions for PatchComm to align to further resolve linguistic ambiguities. Finally, I introduce DialComm to lay the groundwork for moving from single-sentence disambiguation to discourse. Specifically, DialComm builds upon PatchComm to obtain information from single sentences and integrates such information with additional commonsense assertions to build integral frame representations for discourses. I illustrate DialComm’s ability with an application to end-user programming in natural language.
The contributions of this dissertation lie in showing how commonsense inference can be integrated with parsing to resolve ambiguities in natural language, in a transparent manner. I have implemented three candidate systems, with increasingly sophisticated approaches. I verified that they perform well on some standard tests, and they operate in such a way that is understandable to people. This obviates the mythical inevitability of an interpretability-performance tradeoff. I have shown how my techniques can be used in a candidate application, programming in natural language.
My work leaves us in a good position to exploit further advances in natural language understanding and commonsense inference. I am optimistic that natural, transparent communication with computers will help make the world a better place.
|
236 |
Incorporating semantic and syntactic information into document representation for document clusteringWang, Yong 06 August 2005 (has links)
Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets.
|
237 |
Introducing Semantic Role Labels and Enhancing Dependency Parsing to Compute Politeness in Natural LanguageDua, Smrite 13 August 2015 (has links)
No description available.
|
238 |
Improving NLP Systems Using Unconventional, Freely-Available DataHuang, Fei January 2013 (has links)
Sentence labeling is a type of pattern recognition task that involves the assignment of a categorical label to each member of a sentence of observed words. Standard supervised sentence-labeling systems often have poor generalization: it is difficult to estimate parameters for words which appear in the test set, but seldom (or never) appear in the training set, because they only use words as features in their prediction tasks. Representation learning is a promising technique for discovering features that allow a supervised classifier to generalize from a source domain dataset to arbitrary new domains. We demonstrate that features which are learned from distributional representations of unlabeled data can be used to improve performance on out-of-vocabulary words and help the model to generalize. We also argue that it is important for a representation learner to be able to incorporate expert knowledge during its search for helpful features. We investigate techniques for building open-domain sentence labeling systems that approach the ideal of a system whose accuracy is high and consistent across domains. In particular, we investigate unsupervised techniques for language model representation learning that provide new features which are stable across domains, in that they are predictive in both the training and out-of-domain test data. In experiments, our best system with the proposed techniques reduce error by as much as 11.4% relative to the previous system using traditional representations on the Part-of-Speech tagging task. Moreover, we leverage the Posterior Regularization framework, and develop an architecture for incorporating biases from prior knowledge into representation learning. We investigate three types of biases: entropy bias, distance bias and predictive bias. Experiments on two domain adaptation tasks show that our biased learners identify significantly better sets of features than unbiased learners. This results in a relative reduction in error of more than 16% for both tasks with respect to existing state-of-the-art representation learning techniques. We also extend the idea of using additional unlabeled data to improve the system's performance on a different NLP task, word alignment. Traditional word alignment only takes a sentence-level aligned parallel corpus as input and generates the word-level alignments. However, as the integration of different cultures, more and more people are competent in multiple languages, and they often use elements of multiple languages in conversations. Linguist Code Switching (LCS) is such a situation where two or more languages show up in the context of a single conversation. Traditional machine translation (MT) systems treat LCS data as noise, or just as regular sentences. However, if LCS data is processed intelligently, it can provide a useful signal for training word alignment and MT models. In this work, we first extract constraints from this code switching data and then incorporate them into a word alignment model training procedure. We also show that by using the code switching data, we can jointly train a word alignment model and a language model using co-training. Our techniques for incorporating LCS data improve by 2.64 in BLEU score over a baseline MT system trained using only standard sentence-aligned corpora. / Computer and Information Science
|
239 |
Knowledge intensive natural language generation with revisionCline, Ben E. 09 September 2008 (has links)
Traditional natural language generation systems use a pipelined architecture. Two problems with this architecture are poor task decomposition and the lack of interaction between conceptual and stylistic decisions making. A revision architecture operating in a knowledge intensive environment is proposed as a means to deal with these two problems. In a revision system. text is produced and refined iteratively. A text production cycle consists of two steps. First, the text generators produce initial text. Second, this text is examined for defects by revisors. When defects are found the revisors make suggestions for the regeneration of the text. The text generator/revision cycle continues to polish the text iteratively until no more defects can be found. Although previous research has focused on stylistic revisions only. this paper describes techniques for both stylistic and conceptual revisions.
Using revision to produce extended natural language text through a series of drafts provides three significant advantages over a traditional natural language generation system. First, it reduces complexity through task decomposition. Second, it promotes text polishing techniques that benefit from the ability to examine generated text in the context of the underlying knowledge from which it was generated. Third, it provides a mechanism for the integrated handling of conceptual and stylistic decisions.
For revision to operate intelligently and efficiently, the revision component must have access to both the surface text and the underlying knowledge from which it was generated. A knowledge intensive architecture with a uniform knowledge base allows the revision software to quickly locate referents, choices made in producing the defective text, alternatives to the decisions made at both the conceptual and stylistic levels, and the intent of the text. The revisors use this knowledge, along with facts about the topic at hand and knowledge about how text is produced. to select alternatives for improving the text.
The Kalos system was implemented to illustrate revision processing in a natural language generation system. It produces advanced draft quality text for a microprocessor users' guide from a knowledge base describing the microprocessor. It uses revision techniques in a knowledge intensive environment to iteratively polish its initial generation. The system performs both conceptual and stylistic revisions. Example output from the system, showing both types of revision, is presented and discussed. Techniques for dealing with the computational problems caused by the system's uniform knowledge base are described. / Ph. D.
|
240 |
Summarizing Legal DepositionsChakravarty, Saurabh 18 January 2021 (has links)
Documents like legal depositions are used by lawyers and paralegals to ascertain the facts
pertaining to a case. These documents capture the conversation between a lawyer and a
deponent, which is in the form of questions and answers. Applying current automatic summarization
methods to these documents results in low-quality summaries. Though extensive
research has been performed in the area of summarization, not all methods succeed in all
domains. Accordingly, this research focuses on developing methods to generate high-quality
summaries of depositions. As part of our work related to legal deposition summarization, we
propose a solution in the form of a pipeline of components, each addressing a sub-problem;
we argue that a pipeline based framework can be tuned to summarize documents from any
domain.
First, we developed methods to parse the depositions, accounting for different document
formats. We were able to successfully parse both a proprietary and a public dataset with
our methods. We next developed methods to anonymize the personal information present in
the deposition documents; we achieve 95% accuracy on the anonymization using a random
sampling based evaluation. Third, we developed an ontology to define dialog acts for the
questions and answers present in legal depositions. Fourth, we developed classifiers based
on this ontology and achieved F1-scores of 0.84 and 0.87 on the public and proprietary
datasets, respectively. Fifth, we developed methods to transform a question-answer pair to
a canonical/simple form. In particular, based on the dialog acts for the question and answer
combination, we developed transformation methods using each of traditional NLP, and deep
learning, techniques. We were able to achieve good scores on the ROUGE and semantic similarity
metrics for most of the dialog act combinations. Sixth, we developed methods based
on deep learning, heuristics, and machine translation to correct the transformed declarative
sentences. The sentence correction improved the readability of the transformed sentences.
Seventh, we developed a methodology to break a deposition into its topical aspects. An
ontology for aspects was defined for legal depositions, and classifiers were developed that
achieved an F1-score of 0.89. Eighth, we developed methods to segment the deposition into
parts that have the same thematic context. The segments helped in augmenting candidate
summary sentences with surrounding context, that leads to a more readable summary.
Ninth, we developed a pipeline to integrate all of the methods, to generate summaries from
the depositions. We were able to outperform the baseline and state of the art summarization
methods in a majority of the cases based on the F1, Recall, and ROUGE-2 scores. The performance
gains were statistically significant for all of the scores. The summaries generated
by our system can be arranged based on the same thematic context or aspect and hence
should be much easier to read and follow, compared to the baseline methods. As part of our
future work, we will improve upon these methods. We will refine our methods to identify
the important parts using additional documents related to a deposition. In addition, we will
work to improve the compression ratio of the generated summaries by reducing the number
of unimportant sentences. We will expand the training dataset to learn and tune the coverage
of the aspects for various deponent types using empirical methods.
Our system has demonstrated effectiveness in transforming a QA pair into a declarative
sentence. Having such a capability could enable us to generate a narrative summary from
the depositions, a first for legal depositions. We will also expand our dataset for evaluation
to ensure that our methods are indeed generalizable, and that they work well when experts
subjectively evaluate the quality of the deposition summaries. / Doctor of Philosophy / Documents in the legal domain are of various types. One set of documents includes trial and
deposition transcripts. These documents capture the proceedings of a trial or a deposition
by note-taking, often over many hours. They contain conversation sentences that are spoken
during the trial or deposition and involve multiple actors. One of the greatest challenges
with these documents is that generally, they are long. This is a source of pain for attorneys
and paralegals who work with the information contained in the documents.
Text summarization techniques have been successfully used to compress a document and capture
the salient parts from it. They have also been able to reduce redundancy in summary
sentences while focusing on coherence and proper sentence formation. Summarizing trial and
deposition transcripts would be immensely useful for law professionals, reducing the time to
identify and disseminate salient information in case related documents, as well as reducing
costs and trial preparation time. Processing the deposition documents using traditional text
processing techniques is a challenge because of their form. Having the deposition conversations
transformed into a suitable declarative form where they can be easily comprehended
can pave the way for the usage of extractive and abstractive summarization methods. As
part of our work, we identified the different discourse structures present in the deposition
in the form of dialog acts. We developed methods based on those dialog acts to transform
the deposition into a declarative form. We were able to achieve an accuracy of 87% on the
dialog act classification. We also were able to transform the conversational question-answer
(QA) pairs into declarative forms for 10 of the top-11 dialog act combinations. Our transformation
methods performed better in 8 out of the 10 QA pair types, when compared to the
baselines. We also developed methods to classify the deposition QA pairs according to their
topical aspects. We generated summaries using aspects by defining the relative coverage for
each aspect that should be present in a summary. Another set of methods developed can
segment the depositions into parts that have the same thematic context. These segments
aid augmenting the candidate summary sentences, to create a summary where information
is surrounded by associated context. This makes the summary more readable and informative;
we were able to significantly outperform the state of the art methods, based on our
evaluations.
|
Page generated in 0.0983 seconds