Global ETD Search

1	Facilitating Corpus Annotation by Improving Annotation Aggregation Felt, Paul L 01 December 2015 (has links) (PDF) Annotated text corpora facilitate the linguistic investigation of language as well as the automation of natural language processing (NLP) tasks. NLP tasks include problems such as spam email detection, grammatical analysis, and identifying mentions of people, places, and events in text. However, constructing high quality annotated corpora can be expensive. Cost can be reduced by employing low-cost internet workers in a practice known as crowdsourcing, but the resulting annotations are often inaccurate, decreasing the usefulness of a corpus. This inaccuracy is typically mitigated by collecting multiple redundant judgments and aggregating them (e.g., via majority vote) to produce high quality consensus answers. We improve the quality of consensus labels inferred from imperfect annotations in a number of ways. We show that transfer learning can be used to derive benefit from out-dated annotations which would typically be discarded. We show that, contrary to popular preference, annotation aggregation models that take a generative data modeling approach tend to outperform those that take a condition approach. We leverage this insight to develop csLDA, a novel annotation aggregation model that improves on the state of the art for a variety of annotation tasks. When data does not permit generative data modeling, we identify a conditional data modeling approach based on vector-space text representations that achieves state-of-the-art results on several unusual semantic annotation tasks. Finally, we identify a family of models capable of aggregating annotation data containing heterogenous annotation types such as label frequencies and labeled features. We present a multiannotator active learning algorithm for this model family that jointly selects an annotator, data items, and annotation type. crowdsourcing corpus annotation semantic embeddings LDA rich prior knowledge Computer Sciences
2	Semantic Structuring Of Digital Documents: Knowledge Graph Generation And Evaluation Luu, Erik E 01 June 2024 (has links) (PDF) In the era of total digitization of documents, navigating vast and heterogeneous data landscapes presents significant challenges for effective information retrieval, both for humans and digital agents. Traditional methods of knowledge organization often struggle to keep pace with evolving user demands, resulting in suboptimal outcomes such as information overload and disorganized data. This thesis presents a case study on a pipeline that leverages principles from cognitive science, graph theory, and semantic computing to generate semantically organized knowledge graphs. By evaluating a combination of different models, methodologies, and algorithms, the pipeline aims to enhance the organization and retrieval of digital documents. The proposed approach focuses on representing documents as vector embeddings, clustering similar documents, and constructing a connected and scalable knowledge graph. This graph not only captures semantic relationships between documents but also ensures efficient traversal and exploration. The practical application of the system is demonstrated in the context of digital libraries and academic research, showcasing its potential to improve information management and discovery. The effectiveness of the pipeline is validated through extensive experiments using contemporary open-source tools. Knowledge Management Semantic Embeddings Knowledge Graphs Documents Artificial Intelligence and Robotics Databases and Information Systems Data Science

Search results

Facilitating Corpus Annotation by Improving Annotation Aggregation

Semantic Structuring Of Digital Documents: Knowledge Graph Generation And Evaluation