• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 943
  • 156
  • 74
  • 56
  • 27
  • 23
  • 18
  • 13
  • 10
  • 9
  • 8
  • 7
  • 5
  • 5
  • 4
  • Tagged with
  • 1622
  • 1622
  • 1622
  • 626
  • 573
  • 469
  • 387
  • 376
  • 271
  • 256
  • 246
  • 230
  • 221
  • 212
  • 208
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
241

Summarizing Legal Depositions

Chakravarty, Saurabh 18 January 2021 (has links)
Documents like legal depositions are used by lawyers and paralegals to ascertain the facts pertaining to a case. These documents capture the conversation between a lawyer and a deponent, which is in the form of questions and answers. Applying current automatic summarization methods to these documents results in low-quality summaries. Though extensive research has been performed in the area of summarization, not all methods succeed in all domains. Accordingly, this research focuses on developing methods to generate high-quality summaries of depositions. As part of our work related to legal deposition summarization, we propose a solution in the form of a pipeline of components, each addressing a sub-problem; we argue that a pipeline based framework can be tuned to summarize documents from any domain. First, we developed methods to parse the depositions, accounting for different document formats. We were able to successfully parse both a proprietary and a public dataset with our methods. We next developed methods to anonymize the personal information present in the deposition documents; we achieve 95% accuracy on the anonymization using a random sampling based evaluation. Third, we developed an ontology to define dialog acts for the questions and answers present in legal depositions. Fourth, we developed classifiers based on this ontology and achieved F1-scores of 0.84 and 0.87 on the public and proprietary datasets, respectively. Fifth, we developed methods to transform a question-answer pair to a canonical/simple form. In particular, based on the dialog acts for the question and answer combination, we developed transformation methods using each of traditional NLP, and deep learning, techniques. We were able to achieve good scores on the ROUGE and semantic similarity metrics for most of the dialog act combinations. Sixth, we developed methods based on deep learning, heuristics, and machine translation to correct the transformed declarative sentences. The sentence correction improved the readability of the transformed sentences. Seventh, we developed a methodology to break a deposition into its topical aspects. An ontology for aspects was defined for legal depositions, and classifiers were developed that achieved an F1-score of 0.89. Eighth, we developed methods to segment the deposition into parts that have the same thematic context. The segments helped in augmenting candidate summary sentences with surrounding context, that leads to a more readable summary. Ninth, we developed a pipeline to integrate all of the methods, to generate summaries from the depositions. We were able to outperform the baseline and state of the art summarization methods in a majority of the cases based on the F1, Recall, and ROUGE-2 scores. The performance gains were statistically significant for all of the scores. The summaries generated by our system can be arranged based on the same thematic context or aspect and hence should be much easier to read and follow, compared to the baseline methods. As part of our future work, we will improve upon these methods. We will refine our methods to identify the important parts using additional documents related to a deposition. In addition, we will work to improve the compression ratio of the generated summaries by reducing the number of unimportant sentences. We will expand the training dataset to learn and tune the coverage of the aspects for various deponent types using empirical methods. Our system has demonstrated effectiveness in transforming a QA pair into a declarative sentence. Having such a capability could enable us to generate a narrative summary from the depositions, a first for legal depositions. We will also expand our dataset for evaluation to ensure that our methods are indeed generalizable, and that they work well when experts subjectively evaluate the quality of the deposition summaries. / Doctor of Philosophy / Documents in the legal domain are of various types. One set of documents includes trial and deposition transcripts. These documents capture the proceedings of a trial or a deposition by note-taking, often over many hours. They contain conversation sentences that are spoken during the trial or deposition and involve multiple actors. One of the greatest challenges with these documents is that generally, they are long. This is a source of pain for attorneys and paralegals who work with the information contained in the documents. Text summarization techniques have been successfully used to compress a document and capture the salient parts from it. They have also been able to reduce redundancy in summary sentences while focusing on coherence and proper sentence formation. Summarizing trial and deposition transcripts would be immensely useful for law professionals, reducing the time to identify and disseminate salient information in case related documents, as well as reducing costs and trial preparation time. Processing the deposition documents using traditional text processing techniques is a challenge because of their form. Having the deposition conversations transformed into a suitable declarative form where they can be easily comprehended can pave the way for the usage of extractive and abstractive summarization methods. As part of our work, we identified the different discourse structures present in the deposition in the form of dialog acts. We developed methods based on those dialog acts to transform the deposition into a declarative form. We were able to achieve an accuracy of 87% on the dialog act classification. We also were able to transform the conversational question-answer (QA) pairs into declarative forms for 10 of the top-11 dialog act combinations. Our transformation methods performed better in 8 out of the 10 QA pair types, when compared to the baselines. We also developed methods to classify the deposition QA pairs according to their topical aspects. We generated summaries using aspects by defining the relative coverage for each aspect that should be present in a summary. Another set of methods developed can segment the depositions into parts that have the same thematic context. These segments aid augmenting the candidate summary sentences, to create a summary where information is surrounded by associated context. This makes the summary more readable and informative; we were able to significantly outperform the state of the art methods, based on our evaluations.
242

Evaluation of Word and Paragraph Embeddings and Analogical Reasoning as an  Alternative to Term Frequency-Inverse Document Frequency-based Classification in Support of Biocuration

Sullivan, Daniel Edward 07 June 2016 (has links)
This research addresses the problem, can unsupervised learning generate a representation that improves on the commonly used term frequency-inverse document frequency (TF-IDF ) representation by capturing semantic relations? The analysis measures the quality of sentence classification using term TF-IDF representations, and finds a practical upper limit to precision and recall in a biomedical text classification task (F1-score of 0.85). Arguably, one could use ontologies to supplement TF-IDF, but ontologies are sparse in coverage and costly to create. This prompts a correlated question: can unsupervised learning capture semantic relations at least as well as existing ontologies, and thus supplement existing sparse ontologies? A shallow neural network implementing the Skip-Gram algorithm is used to generate semantic vectors using a corpus of approximately 2.4 billion words. The ability to capture meaning is assessed by comparing semantic vectors generated with MESH. Results indicate that semantic vectors trained by unsupervised methods capture comparable levels of semantic features in some cases, such as amino acid (92% of similarity represented in MESH), but perform substantially poorer in more expansive topics, such as pathogenic bacteria (37.8% similarity represented in MESH). Possible explanations for this difference in performance are proposed along with a method to combine manually curated ontologies with semantic vector spaces to produce a more comprehensive representation than either alone. Semantic vectors are also used as representations for paragraphs, which, when used for classification, achieve an F1-score of 0.92. The results of classification and analogical reasoning tasks are promising but a formal model of semantic vectors, subject to the constraints of known linguistic phenomenon, is needed. This research includes initial steps for developing a formal model of semantic vectors based on a combination of linear algebra and fuzzy set theory subject to the semantic molecularism linguistic model. This research is novel in its analysis of semantic vectors applied to the biomedical domain, analysis of different performance characteristics in biomedical analogical reasoning tasks, comparison semantic relations captured by between vectors and MESH, and the initial development of a formal model of semantic vectors. / Ph. D.
243

Arabic News Text Classification and Summarization: A Case of the Electronic Library Institute SeerQ (ELISQ)

Kan'an, Tarek Ghaze 21 July 2015 (has links)
Arabic news articles in heterogeneous electronic collections are difficult for users to work with. Two problems are: that they are not categorized in a way that would aid browsing, and that there are no summaries or detailed metadata records that could be easier to work with than full articles. To address the first problem, schema mapping techniques were adapted to construct a simple taxonomy for Arabic news stories that is compatible with the subject codes of the International Press Telecommunications Council. So that each article would be labeled with the proper taxonomy category, automatic classification methods were researched, to identify the most appropriate. Experiments showed that the best features to use in classification resulted from a new tailored stemming approach (i.e., a new Arabic light stemmer called P-Stemmer). When coupled with binary classification using SVM, the newly developed approach proved to be superior to state-of-the-art techniques. To address the second problem, i.e., summarization, preliminary work was done with English corpora. This was in the context of a new Problem Based Learning (PBL) course wherein students produced template summaries of big text collections. The techniques used in the course were extended to work with Arabic news. Due to the lack of high quality tools for Named Entity Recognition (NER) and topic identification for Arabic, two new tools were constructed: RenA for Arabic NER, and ALDA for Arabic topic extraction tool (using the Latent Dirichlet Algorithm). Controlled experiments with each of RenA and ALDA, involving Arabic speakers and a randomly selected corpus of 1000 Qatari news articles, showed the tools produced very good results (i.e., names, organizations, locations, and topics). Then the categorization, NER, topic identification, and additional information extraction techniques were combined to produce approximately 120,000 summaries for Qatari news articles, which are searchable, along with the articles, using LucidWorks Fusion, which builds upon Solr software. Evaluation of the summaries showed high ratings based on the 1000-article test corpus. Contributions of this research with Arabic news articles thus include a new: test corpus, taxonomy, light stemmer, classification approach, NER tool, topic identification tool, and template-based summarizer – all shown through experimentation to be highly effective. / Ph. D.
244

Hyperpartisanship in Web Searched Articles

Sen, Anamika Ashit 21 August 2019 (has links)
News consumption is primarily done through online news media outlets and social media. There has been a recent rise in both fake news generation, and consumption. Fake news refers to articles that deliberately contain false information to influence readers. Substantial dissemination of misinformation has been recognized to influence election results. This work focuses on hyperpartisanship in web-searched articles which refers to web searched articles which have polarized views and which represent a sensationalized view of the content. There are many such news websites which cater to propagating biased news for political and/or financial gain. This work uses Natural Language Processing (NLP) techniques on news articles to find out if a web-searched article can be termed as hyperpartisan or not. The methods were developed using a labeled dataset which was released as a part of the SemEval Task 4 - Hyperpartisan News Detection. The model was applied to queries related to U. S. midterm elections in 2018. We found that more than half the articles in web search queries showed hyperpartisanship attributes. / Master of Science / Over the recent years, the World Wide Web (WWW) has become a very important part of society. It has overgrown as a powerful medium not only to communicate with known contacts but also to gather, understand and propagate ideas with the whole world. However, in recent times there has been an increasing generation and consumption of misinformation and disinformation. These type of news, particularly fake and hyperpartisan news are particularly curated so as to hide the actual facts, and to present a biased, made-up view of the issue at hand. This activity can be harmful to the society as greater the spread and/or consumption of such news would be, more would be the negative decisions made by the readers. Thus, it poses a bigger threat to society as it affects the actions of people affected by the news. In this work, we look into a similar genre of misinformation that is hyperpartisan news. Hyperpartisan news follows a hyperpartisan orientation - the news exhibits biased opinions towards a entity (party, people, etc.) In this work, we explore to find how Natural Language Processing (NLP) methods could be used to automate the finding of hyperpartisanship in web searched articles, focusing on extraction of the linguistic features. We extend our work to test our findings in the web-searched articles related to midterm elections 2018.
245

Narrative Generation to Support Causal Exploration of Directed Graphs

Choudhry, Arjun 02 June 2020 (has links)
Causal graphs are a useful notation to represent the interplay between the actors as well as the polarity and strength of the relationship that they share. They are used extensively in educational, professional, and industrial contexts to simulate different scenarios, validate behavioral aspects, visualize the connections between different processes, and explore the adversarial effects of changing certain nodes. However, as the size of the causal graphs increase, interpreting them also becomes increasingly tougher. In such cases, new analytical tools are required to enhance the user's comprehension of the graph, both in terms of correctness and speed. To this purpose, this thesis introduces 1) a system that allows for causal exploration of directed graphs, while enabling the user to see the effect of interventions on the target nodes, 2) the use of natural language generation techniques to create a coherent passage explaining the propagation effects, and 3) results of an expert user study validating the efficacy of the narratives in enhancing the user's understanding of the causal graphs. In overall, the system aims to enhance user experience and promote further causal exploration. / Master of Science / Narrative generation is the art of creating coherent snippets of text that cumulatively describe a succession of events, played across a period of time. These goals of narrative generation are also shared by causal graphs – models that encapsulate inferences between the nodes through the strength and polarity of the connecting edges. Causal graphs are an useful mechanism to visualize changes propagating amongst nodes in the system. However, as the graph starts addressing real-world actors and their interactions, it becomes increasingly difficult to understand causal inferences between distant nodes, especially if the graph is cyclic. Moreover, if the value of more than a single node is altered and the cumulative effect of the change is to be perceived on a set of target nodes, it becomes extremely difficult to the human eye. This thesis attempts to alleviate this problem by generating dynamic narratives detailing the effect of one or more interventions on one or more target nodes, incorporating time-series analysis, Wikification, and spike detection. Moreover, the narrative enhances the user's understanding of the change propagation occurring in the system. The efficacy of the narrative was further corroborated by the results of user studies, which concluded that the presence of the narrative aids the user's confidence level, correctness, and speed while exploring the causal network.
246

Measuring the Functionality of Amazon Alexa and Google Home Applications

Wang, Jiamin 01 1900 (has links)
Voice Personal Assistant (VPA) is a software agent, which can interpret the user's voice commands and respond with appropriate information or action. The users can operate the VPA by voice to complete multiple tasks, such as read the message, order coffee, send an email, check the news, and so on. Although this new technique brings in interesting and useful features, they also pose new privacy and security risks. The current researches have focused on proof-of-concept attacks by pointing out the potential ways of launching the attacks, e.g., craft hidden voice commands to trigger malicious actions without noticing the user, fool the VPA to invoke the wrong applications. However, lacking a comprehensive understanding of the functionality of the skills and its commands prevents us from analyzing the potential threats of these attacks systematically. In this project, we developed convolutional neural networks with active learning and keyword-based approach to investigate the commands according to their capability (information retrieval or action injection) and sensitivity (sensitive or nonsensitive). Through these two levels of analysis, we will provide a complete view of VPA skills, and their susceptibility to the existing attacks. / M.S. / Voice Personal Assistant (VPA) is a software agent, which can interpret the users' voice commands and respond with appropriate information or action. The current popular VPAs are Amazon Alexa, Google Home, Apple Siri and Microsoft Cortana. The developers can build and publish third-party applications, called skills in Amazon Alex and actions in Google Homes on the VPA server. The users simply "talk" to the VPA devices to complete different tasks, like read the message, order coffee, send an email, check the news, and so on. Although this new technique brings in interesting and useful features, they also pose new potential security threats. Recent researches revealed that the vulnerabilities exist in the VPA ecosystems. The users can incorrectly invoke the malicious skill whose name has similar pronunciations to the user-intended skill. The inaudible voice triggers the unintended actions without noticing users. All the current researches focused on the potential ways of launching the attacks. The lack of a comprehensive understanding of the functionality of the skills and its commands prevents us from analyzing the potential consequences of these attacks systematically. In this project, we carried out an extensive analysis of third-party applications from Amazon Alexa and Google Home to characterize the attack surfaces. First, we developed a convolutional neural network with active learning framework to categorize the commands according to their capability, whether they are information retrieval or action injection commands. Second, we employed the keyword-based approach to classifying the commands into sensitive and nonsensitive classes. Through these two levels of analysis, we will provide a complete view of VPA skills' functionality, and their susceptibility to the existing attacks.
247

Describing Trail Cultures through Studying Trail Stakeholders and Analyzing their Tweets

Bartolome, Abigail Joy 08 August 2018 (has links)
While many people enjoy hiking as a weekend activity, to many outdoor enthusiasts there is a hiking culture with which they feel affiliated. However, the way that these cultures interact with each other is still unclear. Exploring these different cultures and understanding how they relate to each other can help in engaging stakeholders of the trail. This is an important step toward finding ways to encourage environmentally friendly outdoor recreation practices and developing hiker-approved (and environmentally conscious) technologies to use on the trail. We explored these cultures by analyzing an extensive collection of tweets (over 1.5 million). We used topic modeling to identify the topics described by the communities of Triple Crown trails. We labeled training data for a classifier that identifies tweets relating to depreciative behaviors on the trail. Then, we compared the distribution of tweets across various depreciative trail behaviors to those of corresponding blog posts in order to see how tweets reflected cultures in comparison with blog posts. To harness metadata beyond the text of the tweets, we experimented with visualization techniques. We combined those efforts with ethnographic studies of hikers and conservancy organizations to produce this exploration of trail cultures. In this thesis, we show that through the use of natural language processing, we can identify cultural differences between trail communities. We identify the most significantly discussed forms of trail depreciation, which is helpful to conservation organizations so that they can more appropriately share which Leave No Trace practices hikers should place extra effort into practicing. / Master of Science / In a memoir of her hike on the Pacific Crest Trail, Wild, Cheryl Strayed said to a reporter in an amused tone, “I’m not a hobo, I’m a long-distance hiker”. While many people enjoy hiking as a weekend activity, to many outdoor enthusiasts there is a hiking culture with which they feel affiliated. There are cultures of trail conservation, and cultures of trail depreciation. There are cultures of long-distance hiking, and there are cultures of day hiking and weekend warrior hiking. There are also cultures across different hiking trails—where the hikers of one trail have different sets of values and behaviors than for another trail. However, the way that these cultures interact with each other is still unclear. Exploring these different cultures and understanding how they relate to each other can help in engaging stakeholders of the trail. This is an important step toward finding ways to encourage environmentally friendly outdoor recreation practices and developing hiker-approved (and environmentally conscious) technologies to use on the trail. We decided to explore these cultures by analyzing an extensive collection of tweets (over 1.5 million). We combined those expoorts with ethnographic style studies of conservancy organizations and avid hikers to produce this exploration of trail cultures.
248

Learning with Limited Labeled Data: Techniques and Applications

Lei, Shuo 11 October 2023 (has links)
Recent advances in large neural network-style models have demonstrated great performance in various applications, such as image generation, question answering, and audio classification. However, these deep and high-capacity models require a large amount of labeled data to function properly, rendering them inapplicable in many real-world scenarios. This dissertation focuses on the development and evaluation of advanced machine learning algorithms to solve the following research questions: (1) How to learn novel classes with limited labeled data, (2) How to adapt a large pre-trained model to the target domain if only unlabeled data is available, (3) How to boost the performance of the few-shot learning model with unlabeled data, and (4) How to utilize limited labeled data to learn new classes without the training data in the same domain. First, we study few-shot learning in text classification tasks. Meta-learning is becoming a popular approach for addressing few-shot text classification and has achieved state-of-the-art performance. However, the performance of existing approaches heavily depends on the interclass variance of the support set. To address this problem, we propose a TART network for few-shot text classification. The model enhances the generalization by transforming the class prototypes to per-class fixed reference points in task-adaptive metric spaces. In addition, we design a novel discriminative reference regularization to maximize divergence between transformed prototypes in task-adaptive metric spaces to improve performance further. In the second problem we focus on self-learning in cross-lingual transfer task. Our goal here is to develop a framework that can make the pretrained cross-lingual model continue learning the knowledge with large amount of unlabeled data. Existing self-learning methods in crosslingual transfer tasks suffer from the large number of incorrectly pseudo-labeled samples used in the training phase. We first design an uncertainty-aware cross-lingual transfer framework with pseudo-partial-labels. We also propose a novel pseudo-partial-label estimation method that considers prediction confidences and the limitation to the number of candidate classes. Next, to boost the performance of the few-shot learning model with unlabeled data, we propose a semi-supervised approach for few-shot semantic segmentation task. Existing solutions for few-shot semantic segmentation cannot easily be applied to utilize image-level weak annotations. We propose a class-prototype augmentation method to enrich the prototype representation by utilizing a few image-level annotations, achieving superior performance in one-/multi-way and weak annotation settings. We also design a robust strategy with softmasked average pooling to handle the noise in image-level annotations, which considers the prediction uncertainty and employs the task-specific threshold to mask the distraction. Finally, we study the cross-domain few-shot learning in the semantic segmentation task. Most existing few-shot segmentation methods consider a setting where base classes are drawn from the same domain as the new classes. Nevertheless, gathering enough training data for meta-learning is either unattainable or impractical in many applications. We extend few-shot semantic segmentation to a new task, called Cross-Domain Few-Shot Semantic Segmentation (CD-FSS), which aims to generalize the meta-knowledge from domains with sufficient training labels to low-resource domains. Then, we establish a new benchmark for the CD-FSS task and evaluate both representative few-shot segmentation methods and transfer learning based methods on the proposed benchmark. We then propose a novel Pyramid-AnchorTransformation based few-shot segmentation network (PATNet), in which domain-specific features are transformed into domain-agnostic ones for downstream segmentation modules to fast adapt to unseen domains. / Doctor of Philosophy / Nowadays, deep learning techniques play a crucial role in our everyday existence. In addition, they are crucial to the success of many e-commerce and local businesses for enhancing data analytics and decision-making. Notable applications include intelligent transportation, intelligent healthcare, the generation of natural language, and intrusion detection, among others. To achieve reasonable performance on a new task, these deep and high-capacity models require thousands of labeled examples, which increases the data collection effort and computation costs associated with training a model. Moreover, in many disciplines, it might be difficult or even impossible to obtain data due to concerns such as privacy and safety. This dissertation focuses on learning with limited labeled data in natural language processing and computer vision tasks. To recognize novel classes with a few examples in text classification tasks, we develop a deep learning-based model that can capture both cross- task transferable knowledge and task-specific features. We also build an uncertainty-aware self-learning framework and a semi-supervised few-shot learning method, which allow us to boost the pre-trained model with easily accessible unlabeled data. In addition, we propose a cross-domain few-shot semantic segmentation method to generalize the model to different domains with a few examples. By handling these unique challenges in learning with limited labeled data and developing suitable approaches, we hope to improve the efficiency and generalization of deep learning methods in the real world.
249

Andromeda in Education: Studies on Student Collaboration and Insight Generation with Interactive Dimensionality Reduction

Taylor, Mia Rachel 04 October 2022 (has links)
Andromeda is an interactive visualization tool that projects high-dimensional data into a scatterplot-like visualization using Weighted Multidimensional Scaling (WMDS). The visualization can be explored through surface-level interaction (viewing data values), parametric interaction (altering underlying parameterizations), and observation-level interaction (directly interacting with projected points). This thesis presents analyses on the collaborative utility of Andromeda in a middle school class and the insights college-level students generate when using Andromeda. The first study discusses how a middle school class collaboratively used Andromeda to explore and compare their engineering designs. The students analyzed their designs, represented as high-dimensional data, as a class. This study shows promise for introducing collaborative data analysis to middle school students in conjunction with other technical concepts such as the engineering design process. Participants in the study on college-level students were given a version of Andromeda, with access to different interactions, and were asked to generate insights on a dataset. By applying a novel visualization evaluation methodology on students' natural language insights, the results of this study indicate that students use different vocabulary supported by the interactions available to them, but not equally. The implications, as well as limitations, of these two studies are further discussed. / Master of Science / Data is often high-dimensional. A good example of this is a spreadsheet with many columns. Visualizing high-dimensional data is a difficult task because it must capture all information in 2 or 3 dimensions. Andromeda is a tool that can project high-dimensional data into a scatterplot-like visualization. Data points that are considered similar are plotted near each other and vice versa. Users can alter how important certain parts of the data are to the plotting algorithm as well as move points directly to update the display based on the user-specified layout. These interactions within Andromeda allow data analysts to explore high-dimensional data based on their personal sensemaking processes. As high dimensional thinking and exploratory data analysis are being introduced into more classrooms, it is important to understand the ways in which students analyze high-dimensional data. To address this, this thesis presents two studies. The first study discusses how a middle school class used Andromeda for their engineering design assignments. The results indicate that using Andromeda in a collaborative way enriched the students' learning experience. The second study analyzes how college-level students, when given access to different interaction types in Andromeda, generate insights into a dataset. Students use different vocabulary supported by the interactions available to them, but not equally. The implications, as well as limitations, of these two studies are further discussed.
250

Segmenting Electronic Theses and Dissertations By Chapters

Manzoor, Javaid Akbar 18 January 2023 (has links)
Master of Science / Electronic theses and dissertations (ETDs) are structured documents in which chapters are major components. There is a lack of any repository that contains chapter boundary details alongside these structured documents. Revealing these details of the documents can help increase accessibility. This research explores the manipulation of ETDs marked up using LaTeX to generate chapter boundaries. We use this to create a data set of 1,459 ETDs and their chapter boundaries. Additionally, for the task of automatic segmentation of unseen documents, we prototype three deep learning models that are trained using this data set. We hope to encourage researchers to incorporate LaTeX manipulation techniques to create similar data sets.

Page generated in 0.1205 seconds