Global ETD Search

431	Pro-Drop and Word-Order Variation in Brazilian Portuguese: A Corpus Study Smith, Stewart Daniel 03 July 2013 (has links) (PDF) The present study examines certain syntactic properties of the Brazilian variety of Portuguese (BP): 1) BP is a pro-drop language with instances of both null subjects and covert objects, and 2) BP exhibits several possible word orders. To determine the frequency of pro-drop and word-order variations, the CDP (The Portuguese Corpus) was used to provide samples of transitive, main clauses, which were then categorized based on whether or not they had null subjects and covert objects. The clauses were also categorized according to word order. In addition to providing samples, the corpus allowed for the comparison of four different registers of BP: academic, newspaper, fiction, and oral. The results of the present study demonstrated that null subjects are much more common than covert objects (29.4% and 2.3% respectively) and that register did significantly affect the frequency of pro-drop, with oral having the highest rate of pro-drop and newspaper the lowest. For word order, SVO was most common at 95.1% with the occurrences of other variations being too rare to reliably determine statistical significance. Different from pro-drop, register did not affect the frequency of different word orders. Word-order variations were not random, however, but were determined by topic and focus with old information (topic) generally occurring preverbally, and new information (focus) generally occurring in the most embedded position. The fact that this study effectively examined these syntactic features is significant, as most of the Portuguese syntactic research previous to the present study was specific to European Portuguese. The present study demonstrated a new methodology being successfully applied to a different dialect, but more than that, it demonstrated that a more empirical, data-driven approach to syntactic research is both possible and valuable, justifying the creation and use of large corpora for this type of research. Brazilian Portuguese pro-drop null subject word order topic focus corpus Linguistics
432	Topic discovery and document similarity via pre-trained word embeddings Chen, Simin January 2018 (has links) Throughout the history, humans continue to generate an ever-growing volume of documents about a wide range of topics. We now rely on computer programs to automatically process these vast collections of documents in various applications. Many applications require a quantitative measure of the document similarity. Traditional methods first learn a vector representation for each document using a large corpus, and then compute the distance between two document vectors as the document similarity.In contrast to this corpus-based approach, we propose a straightforward model that directly discovers the topics of a document by clustering its words, without the need of a corpus. We define a vector representation called normalized bag-of-topic-embeddings (nBTE) to encapsulate these discovered topics and compute the soft cosine similarity between two nBTE vectors as the document similarity. In addition, we propose a logistic word importance function that assigns words different importance weights based on their relative discriminating power.Our model is efficient in terms of the average time complexity. The nBTE representation is also interpretable as it allows for topic discovery of the document. On three labeled public data sets, our model achieved comparable k-nearest neighbor classification accuracy with five stateof-art baseline models. Furthermore, from these three data sets, we derived four multi-topic data sets where each label refers to a set of topics. Our model consistently outperforms the state-of-art baseline models by a large margin on these four challenging multi-topic data sets. These works together provide answers to the research question of this thesis:Can we construct an interpretable document represen-tation by clustering the words in a document, and effectively and efficiently estimate the document similarity? / Under hela historien fortsätter människor att skapa en växande mängd dokument om ett brett spektrum av publikationer. Vi förlitar oss nu på dataprogram för att automatiskt bearbeta dessa stora samlingar av dokument i olika applikationer. Många applikationer kräver en kvantitativmått av dokumentets likhet. Traditionella metoder först lära en vektorrepresentation för varje dokument med hjälp av en stor corpus och beräkna sedan avståndet mellan two document vektorer som dokumentets likhet.Till skillnad från detta corpusbaserade tillvägagångssätt, föreslår vi en rak modell som direkt upptäcker ämnena i ett dokument genom att klustra sina ord , utan behov av en corpus. Vi definierar en vektorrepresentation som kallas normalized bag-of-topic-embeddings (nBTE) för att inkapsla de upptäckta ämnena och beräkna den mjuka cosinuslikheten mellan två nBTE-vektorer som dokumentets likhet. Dessutom föreslår vi en logistisk ordbetydelsefunktion som tilldelar ord olika viktvikter baserat på relativ diskriminerande kraft.Vår modell är effektiv när det gäller den genomsnittliga tidskomplexiteten. nBTE-representationen är också tolkbar som möjliggör ämnesidentifiering av dokumentet. På tremärkta offentliga dataset uppnådde vår modell jämförbar närmaste grannklassningsnoggrannhet med fem toppmoderna modeller. Vidare härledde vi från de tre dataseten fyra multi-ämnesdatasatser där varje etikett hänvisar till en uppsättning ämnen. Vår modell överensstämmer överens med de högteknologiska baslinjemodellerna med en stor marginal av fyra utmanande multi-ämnesdatasatser. Dessa arbetsstöd ger svar på forskningsproblemet av tisthesis:Kan vi konstruera en tolkbar dokumentrepresentation genom att klustra orden i ett dokument och effektivt och effektivt uppskatta dokumentets likhet? Computer and Information Sciences Data- och informationsvetenskap
433	Exploring the Potential of Twitter Data and Natural Language Processing Techniques to Understand the Usage of Parks in Stockholm / Utforska potentialen för användning av Natural Language Processing på Twitter data för att förstå användningen av parker i Stockholm Norsten, Theodor January 2020 (has links) Traditional methods used to investigate the usage of parks consists of questionnaire which is both a very time- and- resource consuming method. Today more than four billion people daily use some form of social media platform. This has led to the creation of huge amount of data being generated every day through various social media platforms and has created a potential new source for retrieving large amounts of data. This report will investigate a modern approach, using Natural Language Processing on Twitter data to understand how parks in Stockholm being used. Natural Language Processing (NLP) is an area within artificial intelligence and is referred to the process to read, analyze, and understand large amount of text data and is considered to be the future for understanding unstructured text. Twitter data were obtained through Twitters open API. Data from three parks in Stockholm were collected between the periods 2015-2019. Three analysis were then performed, temporal, sentiment, and topic modeling analysis. The results from the above analysis show that it is possible to understand what attitudes and activities are associated with visiting parks using NLP on social media data. It is clear that sentiment analysis is a difficult task for computers to solve and it is still in an early stage of development. The results from the sentiment analysis indicate some uncertainties. To achieve more reliable results, the analysis would consist of much more data, more thorough cleaning methods and be based on English tweets. One significant conclusion given the results is that people’s attitudes and activities linked to each park are clearly correlated with the different attributes each park consists of. Another clear pattern is that the usage of parks significantly peaks during holiday celebrations and positive sentiments are the most strongly linked emotion with park visits. Findings suggest future studies to focus on combining the approach in this report with geospatial data based on a social media platform were users share their geolocation to a greater extent. / Traditionella metoder använda för att förstå hur människor använder parker består av frågeformulär, en mycket tids -och- resurskrävande metod. Idag använder mer en fyra miljarder människor någon form av social medieplattform dagligen. Det har inneburit att enorma datamängder genereras dagligen via olika sociala media plattformar och har skapat potential för en ny källa att erhålla stora mängder data. Denna undersöker ett modernt tillvägagångssätt, genom användandet av Natural Language Processing av Twitter data för att förstå hur parker i Stockholm används. Natural Language Processing (NLP) är ett område inom artificiell intelligens och syftar till processen att läsa, analysera och förstå stora mängder textdata och anses vara framtiden för att förstå ostrukturerad text. Data från Twitter inhämtades via Twitters öppna API. Data från tre parker i Stockholm erhölls mellan perioden 2015–2019. Tre analyser genomfördes därefter, temporal, sentiment och topic modeling. Resultaten från ovanstående analyser visar att det är möjligt att förstå vilka attityder och aktiviteter som är associerade med att besöka parker genom användandet av NLP baserat på data från sociala medier. Det är tydligt att sentiment analys är ett svårt problem för datorer att lösa och är fortfarande i ett tidigt skede i utvecklingen. Resultaten från sentiment analysen indikerar några osäkerheter. För att uppnå mer tillförlitliga resultat skulle analysen bestått av mycket mer data, mer exakta metoder för data rensning samt baserats på tweets skrivna på engelska. En tydlig slutsats från resultaten är att människors attityder och aktiviteter kopplade till varje park är tydligt korrelerat med de olika attributen respektive park består av. Ytterligare ett tydligt mönster är att användandet av parker är som högst under högtider och att positiva känslor är starkast kopplat till park-besök. Resultaten föreslår att framtida studier fokuserar på att kombinera metoden i denna rapport med geospatial data baserat på en social medieplattform där användare delar sin platsinfo i större utsträckning. Natural Language Processing Sentiment analysis Topic modeling Twitter VADER LDA Other Social Sciences Annan samhällsvetenskap
434	Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities Wang, Xuerui 01 May 2009 (has links) The abundance of data in the information age poses an immense challenge for us: how to perform large-scale inference to understand and utilize this overwhelming amount of information. Such techniques are of tremendous intellectual significance and practical impact. As part of this grand challenge, the goal of my Ph.D. thesis is to develop effective and efficient statistical topic models for massive text collections by incorporating extra information from other modalities in addition to the text itself. Text documents are not just text, and different kinds of additional information are naturally interleaved with text. Most previous work, however, pays attention to only one modality at a time, and ignore the others. In my thesis, I will present a series of probabilistic topic models to show how we can bridge multiple modalities of information, in a united fashion, for various tasks. Interestingly, joint inference over multiple modalities leads to many findings that can not be discovered from just one modality alone, as briefly illustrated below: Email is pervasive nowadays. Much previous work in natural language processing modeled text using latent topics ignoring the social networks. On the other hand, social network research mainly dealt with the existence of links between entities without taking into consideration the language content or topics on those links. The author-recipient-topic (ART) model, by contrast, steers the discovery of topics according to the relationships between people, and learns topic distributions based on the direction-sensitive messages sent between entities. However, the ART model does not explicitly identify groups formed by entities in the network. Previous work in social network analysis ignores the fact that different groupings arise for different topics. The group-topic (GT) model, a probabilistic generative model of entity relationships and textual attributes, simultaneously discovers groups among the entities and topics among the corresponding text. Many of the large datasets do not have static latent structures; they are instead dynamic. The topics over time (TOT) model explicitly models time as an observed continuous variable. This allows TOT to see long-range dependencies in time and also helps avoid a Markov model's risk of inappropriately dividing a topic in two when there is a brief gap in its appearance. By treating time as a continuous variable, we also avoid the difficulties of discretization. Most topic models, including all of the above, rely on the bag of words assumption. However, word order and phrases are often critical to capturing the meaning of text. The topical n -grams (TNG) model discovers topics as well as meaningful, topical phrases simultaneously. In summary, we believe that these models are clear evidence that we can better understand and utilize massive text collections when additional modalities are considered and modeled jointly with text. Multiple modalities Social network analysis Text mining Topic models Computer Sciences
435	A Geometric Framework for Transfer Learning Using Manifold Alignment Wang, Chang 01 September 2010 (has links) Many machine learning problems involve dealing with a large amount of high-dimensional data across diverse domains. In addition, annotating or labeling the data is expensive as it involves significant human effort. This dissertation explores a joint solution to both these problems by exploiting the property that high-dimensional data in real-world application domains often lies on a lower-dimensional structure, whose geometry can be modeled as a graph or manifold. In particular, we propose a set of novel manifold-alignment based approaches for transfer learning. The proposed approaches transfer knowledge across different domains by finding low-dimensional embeddings of the datasets to a common latent space, which simultaneously match corresponding instances while preserving local or global geometry of each input dataset. We develop a novel two-step transfer learning method called Procrustes alignment. Procrustes alignment first maps the datasets to low-dimensional latent spaces reflecting their intrinsic geometries and then removes the translational, rotational and scaling components from one set so that the optimal alignment between the two sets can be achieved. This approach can preserve either global geometry or local geometry depending on the dimensionality reduction approach used in the first step. We propose a general one-step manifold alignment framework called manifold projections that can find alignments, both across instances as well as across features, while preserving local domain geometry. We develop and mathematically analyze several extensions of this framework to more challenging situations, including (1) when no correspondences across domains are given; (2) when the global geometry of each input domain needs to be respected; (3) when label information rather than correspondence information is available. A final contribution of this thesis is the study of multiscale methods for manifold alignment. Multiscale alignment automatically generates alignment results at different levels by discovering the shared intrinsic multilevel structures of the given datasets, providing a common representation across all input datasets. Dimensionality Reduction Manifold Alignment Multiscale Analysis Representation Learning Topic Model Transfer Learning Computer Sciences
436	MULTI-ATTRIBUTE AND TEMPORAL ANALYSIS OF PRODUCT REVIEWS USING TOPIC MODELLING AND SENTIMENT ANALYSIS Meet Tusharbhai Suthar (14232623) 08 December 2022 (has links) <p>Online reviews are frequently utilized to determine a product's quality before purchase along with the photographs and one-to-five star ratings. The research addressed the two distinct problems observed in the review systems. </p> <p>First, due to thousands of reviews for a product, the different characteristics of customer evaluations, such as consumer sentiments, cannot be understood by manually reading only a few reviews. Second, from these reviews, it is extremely hard to understand the change in these sentiments and other important product aspects over the years (temporal analysis). To address these problems, the study focused on 2 main research parts.</p> <p>Part one of the research was focused on answering how topic modelling and sentiment analysis can work together to give deeper understanding on attribute-based product review. The second part compared different topic modelling approaches to evaluate the performances and advantages of emerging NLP models. For this purpose, a dataset consisting of 469 publicly accessible Amazon evaluations of the Kindle E-reader and 15,000 reviews of iPhone products was utilized to examine sentiment Analysis and Topic modelling. Latent Dirichlet Allocation topic model and BERTopic topic model were used to perform topic modelling and to acquire the diverse topics of concern. Sentiment Analysis was carried out to better understand each topic's positive and negative tones. Topic analysis of Kindle user evaluations revealed the following major themes: (a) leisure consumption, (b) utility as a gift, (c) pricing, (d) parental control, (e) reliability and durability, and (f) charging. While the main themes emerged from the analysis of iPhone reviews depended on the model and year of the device, some themes were found to be consistent across all the iPhone models including (a) Apple vs Android (b) utility as gift and (c) service. The study's approach helped to analyze customer reviews for any product, and the study results provided a deeper understanding of the product's strengths and weaknesses based on a comprehensive analysis of user feedback useful for product makers, retailers, e-commerce platforms, and consumers.</p> Business analytics topic model methods sentiment analytics multi-attribute analysis Natural language processsing LDA
437	A Data-Driven Approach for Incident Handling in DevOps Annadata, Lakshmi Ashritha January 2023 (has links) Background: Maintaining system reliability and customer satisfaction in a DevOps environment requires effective incident management. In the modern day, due to increasing system complexity, several incidents occur daily. Incident prioritization and resolution are essential to manage these incidents and lessen their impact on business continuity. Prioritization of incidents, estimation of recovery time objective (RTO), and resolution times are traditionally subjective processes that rely more on the DevOps team’s competence. However, as the volume of incidents rises, it becomes increasingly challenging to handle them effectively. Objectives: This thesis aims to develop an approach that prioritizes incidents and estimates the corresponding resolution times and RTO values leveraging machine learning. The objective is to provide an effective solution to streamline DevOps activities. To verify the performance of our solution, an evaluation is later carried out by the users in a large organization (Ericsson). Methods: The methodology used for this thesis is design science methodology. It starts with the problem identification phase, where a rapid literature review is done to lay the groundwork for the development of the solution. Cross-Industry Standard Process for Data Mining (CRISP-DM) is carried out later in the development phase. In the evaluation phase, a static validation is carried out in a DevOps environment to collect user feedback on the tool’s usability and feasibility. Results: According to the results, the tool helps the DevOps team prioritize incidents and determine the resolution time and RTO. Based on the team’s feedback, 84% of participants agree that the tool is helpful, and 76% agree that the tool is easy to use and understand. The tool’s performance evaluation of the three metrics chosen for estimating the priority was accuracy 93%, Recall 78%, F1 score 87% on average for all four priority levels, and the BERT accuracy for estimating the resolution time range was 88%. Hence, we can expect the tool to help speed up the incident response’s efficiency and decrease the resolution time. Conclusions: The tool’s validation and implementation indicate that it has the potential to increase the reliability of the system and the effectiveness of incident management in a DevOps setting. Prioritizing incidents and predicting resolution time ranges based on impact and urgency can enable the DevOps team to make well-informed decisions. Some of the future progression for the tool can be to investigate how to integrate it with other third-party DevOps tools and explore areas to provide guidelines to handle sensitive incident data. Another work could be to analyze the tool in a live project and obtain feedback. Incident Handling DevOps Machine Learning Prioritization Topic Modeling Software Engineering Programvaruteknik
438	Topic Modeling for Customer Insights : A Comparative Analysis of LDA and BERTopic in Categorizing Customer Calls Axelborn, Henrik, Berggren, John January 2023 (has links) Customer calls serve as a valuable source of feedback for financial service providers, potentially containing a wealth of unexplored insights into customer questions and concerns. However, these call data are typically unstructured and challenging to analyze effectively. This thesis project focuses on leveraging Topic Modeling techniques, a sub-field of Natural Language Processing, to extract meaningful customer insights from recorded customer calls to a European financial service provider. The objective of the study is to compare two widely used Topic Modeling algorithms, Latent Dirichlet Allocation (LDA) and BERTopic, in order to categorize and analyze the content of the calls. By leveraging the power of these algorithms, the thesis aims to provide the company with a comprehensive understanding of customer needs, preferences, and concerns, ultimately facilitating more effective decision-making processes. Through a literature review and dataset analysis, i.e., pre-processing to ensure data quality and consistency, the two algorithms, LDA and BERTopic, are applied to extract latent topics. The performance is then evaluated using quantitative and qualitative measures, i.e., perplexity and coherence scores as well as in- terpretability and usefulness of topic quality. The findings contribute to knowledge on Topic Modeling for customer insights and enable the company to improve customer engagement, satisfaction and tailor their customer strategies. The results show that LDA outperforms BERTopic in terms of topic quality and business value. Although BERTopic demonstrates a slightly better quantitative performance, LDA aligns much better with human interpretation, indicating a stronger ability to capture meaningful and coherent topics within company’s customer call data. Customer Insights Natural Language Processing Topic Modeling Latent Dirichlet Allocation BERTopic Mathematics Matematik
439	Trend Analysis on Artificial Intelligence Patents Cotra, Aditya Kousik 28 June 2021 (has links) No description available. Computer Science Topic Modeling Patents Artificial Intelligence Natural language processing Trend analysis Data Analysis
440	Changing research topic trends as an effect of publication rankings – The case of German economists and the Handelsblatt Ranking Buehling, Kilian 07 September 2023 (has links) In order to arrive at informed judgments about the quality of research institutions and individual scholars, funding agencies, academic employers and researchers have turned to publication rankings. While such rankings, often based on journal citations, promise a more efficient and transparent funding allocation, individual researchers are at risk of showing adaptive behavior. This paper investigates whether the use of journal rankings in assessing the quality of scholarly research results in the unintended consequence of researchers adapting their research topics to the publishing interests of high-ranked journals. The introduction of the Handelsblatt Ranking (HBR) for economists in German language institutions serves as a quasi-natural experiment, allowing for an examination of research topic dynamics in economics via topic modeling and text classification. It is found that the Handelsblatt Ranking did not cause a significant shift of topics researched by German-affiliated authors in comparison to their international counterparts, even though topic convergence is apparent. info:eu-repo/classification/ddc/004 ddc:004

Search results