Global ETD Search

11	Comparing Text Classification Libraries in Scala and Python : A comparison of precision and recall Garamvölgyi, Filip, Henning Bruce, August January 2021 (has links) In today’s internet era, more text than ever is being uploaded online. The text comes in many forms, such as social media posts, business reviews, and many more. For various reasons, there is an interest in analyzing the uploaded text. For instance, an airline business could ask their customers to review the service they have received. The feedback would be collected by asking the customer to leave a review and a score. A common scenario is a review with a good score that contains negative aspects. It is preferable to avoid a situation where the entirety of the review is regarded as positive because of the score if there are negative aspects mentioned. A solution to this would be to analyze each sentence of a review and classify it by negative, neutral or, positive depending on how the sentence is perceived. With the amount of text uploaded today, it is not feasible to manually analyze text. To automatically classify text by a set of criteria is called text classification. The process of specifically classifying text by how it is perceived is a subcategory of text classification known as sentiment analysis. Positive, neutral and, negative would be the sentiments to classify. The most popular frameworks associated with the implementation of sentiment analyzers are developed in the programming language Python. However, over the years, text classification has had an increase in popularity. The increase in popularity has caused new frameworks to be developed in new programming languages. Scala is one of the programming languages that has had new frameworks developed to work with sentiment analysis. However, in comparison to Python, it has fewer available resources. Python has more available libraries to work with, available documentation, and community support online. There are even fewer resources regarding sentiment analysis in a less common language such as Swedish. The problem is no one has compared a sentiment analyzer for Swedish text implemented using Scala and compared it to Python. The purpose of this thesis is to compare recall and precision of a sentiment analyzer implemented in Scala to Python. The goal of this thesis is to increase the knowledge regarding the state of text classification for less common natural languages in Scala. To conduct the study, a qualitative approach with the support of quantitative data was used. Two kinds of sentiment analyzers were implemented in Scala and Python. The first classified text as either positive or negative (binary sentiment analysis), the second sentiment analyzer would also classify text as neutral (multiclass sentiment analysis). To perform the comparative study, the implemented analyzers would perform classification on text with known sentiments. The quality of the classifications was measured using their F1-score. The results showed that Python had better recall and quality for both tasks. In the binary task, there was not as large of a difference between the two implementations. The resources from Python were more specialized for Swedish and did not seem to be as affected by the small dataset used as the resources in Scala. Scala had an F1-score of 0.78 for binary sentiment analysis and 0.65 for multiclass sentiment analysis. Python had an F1-score of 0.83 for binary sentiment analysis and 0.78 for multiclass sentiment analysis. / I dagens internetera laddas mer text upp än någonsin online. Texten finns i många former, till exempel inlägg på sociala medier, företagsrecensioner och många fler. Av olika skäl finns det ett intresse av att analysera den uppladdade texten. Till exempel kan ett flygbolag be sina kunder att lämna omdömen om tjänsten de nyttjat. Feedbacken samlas in genom att be kunden lämna ett omdöme och ett betyg. Ett vanligt scenario är en recension med ett bra betyg som innehåller negativa aspekter. Det är att föredra att undvika en situation där hela recensionen anses vara positiv på grund av poängen, om det nämnts negativa aspekter. En lösning på detta skulle vara att analysera varje mening i en recension och klassificera den som negativ, neutral eller positiv beroende på hur meningen uppfattas. Med den mängd text som laddas upp idag är det inte möjligt att manuellt analysera text. Att automatiskt klassificera text efter en uppsättning kriterier kallas textklassificering. Processen att specifikt klassificera text efter hur den uppfattas är en underkategori av textklassificering som kallas sentimentanalys. Positivt, neutralt och negativt skulle vara sentiment att klassificera. De mest populära ramverken för implementering av sentimentanalysatorer utvecklas i programmeringsspråket Python. Men genom åren har textklassificering ökat i popularitet. Ökningen i popularitet har gjort att nya ramverk utvecklats för nya programmeringsspråk. Scala är ett av programmeringsspråken som har utvecklat nya ramverk för att arbeta med sentimentanalys. I jämförelse med Python har den dock mindre tillgängliga resurser. Python har mer bibliotek, dokumentation och mer stöd online. Det finns ännu färre resurser när det gäller sentimentanalyser på ett mindre vanligt språk som svenska. Problemet är att ingen har jämfört en sentimentanalysator för svensk text implementerad med Scala och jämfört den med Python. Syftet med denna avhandling är att jämföra precision och recall på en sentimentanalysator implementerad i Scala med Python. Målet med denna avhandling är att öka kunskapen om tillståndet för textklassificering för mindre vanliga naturliga språk i Scala. För att genomföra studien användes ett kvalitativt tillvägagångssätt med stöd av kvantitativa data. Två typer av sentimentanalysatorer implementerades i Scala och Python. Den första klassificerade texten som antingen positiv eller negativ (binär sentimentanalys), den andra sentimentanalysatorn skulle också klassificera text som neutral (sentimentanalys i flera klasser). För att utföra den jämförande studien skulle de implementerade analysatorerna utföra klassificering på text med kända sentiment. Klassificeringarnas kvalitet mättes med deras F1-poäng. Resultaten visade att Python hade bättre precision och recall för båda uppgifterna. I den binära uppgiften var det inte lika stor skillnad mellan de två implementeringarna. Resurserna från Python var mer specialiserade för svenska och verkade inte påverkas lika mycket av den lilla dataset som används som resurserna i Scala. Scala hade ett F1-poäng på 0,78 för binär sentimentanalys och 0,65 för sentimentanalys i flera klasser. Python hade ett F1-poäng på 0,83 för binär sentimentanalys och 0,78 för sentimentanalys i flera klasser. LaBSE Spark NLP NLP Text classification Scala LaBSE Spark NLP NLP Textklassificering Scala Computer and Information Sciences Data- och informationsvetenskap
12	A Semantics-based User Interface Model for Content Annotation, Authoring and Exploration Khalili, Ali 02 February 2015 (has links) (PDF) The Semantic Web and Linked Data movements with the aim of creating, publishing and interconnecting machine readable information have gained traction in the last years. However, the majority of information still is contained in and exchanged using unstructured documents, such as Web pages, text documents, images and videos. This can also not be expected to change, since text, images and videos are the natural way in which humans interact with information. Semantic structuring of content on the other hand provides a wide range of advantages compared to unstructured information. Semantically-enriched documents facilitate information search and retrieval, presentation, integration, reusability, interoperability and personalization. Looking at the life-cycle of semantic content on the Web of Data, we see quite some progress on the backend side in storing structured content or for linking data and schemata. Nevertheless, the currently least developed aspect of the semantic content life-cycle is from our point of view the user-friendly manual and semi-automatic creation of rich semantic content. In this thesis, we propose a semantics-based user interface model, which aims to reduce the complexity of underlying technologies for semantic enrichment of content by Web users. By surveying existing tools and approaches for semantic content authoring, we extracted a set of guidelines for designing efficient and effective semantic authoring user interfaces. We applied these guidelines to devise a semantics-based user interface model called WYSIWYM (What You See Is What You Mean) which enables integrated authoring, visualization and exploration of unstructured and (semi-)structured content. To assess the applicability of our proposed WYSIWYM model, we incorporated the model into four real-world use cases comprising two general and two domain-specific applications. These use cases address four aspects of the WYSIWYM implementation: 1) Its integration into existing user interfaces, 2) Utilizing it for lightweight text analytics to incentivize users, 3) Dealing with crowdsourcing of semi-structured e-learning content, 4) Incorporating it for authoring of semantic medical prescriptions. NLP Semantic Web Annotation User Interface NLP Semantic Web Annotation User Interface ddc:500 NLP Semantic Web Annotation User Interface
13	The Natural Learning Process and Its Implications for Trombone Pedagogy Reider, Shane Robert 05 1900 (has links) This thesis considers the natural learning process as defined by Timothy Gallwey and Daniel Kohut. This learning theory is examined and applied to trombone pedagogy while also considering physiological attributes to trombone performance. a brief synopsis of the history and lineage of the trombone is considered in order to understand the current setting of the trombone medium. Natural learning process NLP trombone pedagogy
14	Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing Coppola, Gregory Francis January 2015 (has links) The development of distributed training strategies for statistical prediction functions is important for applications of machine learning, generally, and the development of distributed structured prediction training strategies is important for natural language processing (NLP), in particular. With ever-growing data sets this is, first, because, it is easier to increase computational capacity by adding more processor nodes than it is to increase the power of individual processor nodes, and, second, because data sets are often collected and stored in different locations. Iterative parameter mixing (IPM) is a distributed training strategy in which each node in a network of processors optimizes a regularized average loss objective on its own subset of the total available training data, making stochastic (per-example) updates to its own estimate of the optimal weight vector, and communicating with the other nodes by periodically averaging estimates of the optimal vector across the network. This algorithm has been contrasted with a close relative, called here the single-mixture optimization algorithm, in which each node stochastically optimizes an average loss objective on its own subset of the training data, operating in isolation until convergence, at which point the average of the independently created estimates is returned. Recent empirical results have suggested that this IPM strategy produces better models than the single-mixture algorithm, and the results of this thesis add to this picture. The contributions of this thesis are as follows. The first contribution is to produce and analyze an algorithm for decentralized stochastic optimization of regularized average loss objective functions. This algorithm, which we call the distributed regularized dual averaging algorithm, improves over prior work on distributed dual averaging by providing a simpler algorithm (used in the rest of the thesis), better convergence bounds for the case of regularized average loss functions, and certain technical results that are used in the sequel. The central contribution of this thesis is to give an optimization-theoretic justification for the IPM algorithm. While past work has focused primarily on its empirical test-time performance, we give a novel perspective on this algorithm by showing that, in the context of the distributed dual averaging algorithm, IPM constitutes a convergent optimization algorithm for arbitrary convex functions, while the single-mixture distribution algorithm is not. Experiments indeed confirm that the superior test-time performance of models trained using IPM, compared to single-mixture, correlates with better optimization of the objective value on the training set, a fact not previously reported. Furthermore, our analysis of general non-smooth functions justifies the use of distributed large-margin (support vector machine [SVM]) training of structured predictors, which we show yields better test performance than the IPM perceptron algorithm, the only version of the IPM to have previously been given a theoretical justification. Our results confirm that IPM training can reach the same level of test performance as a sequentially trained model and can reach better accuracies when one has a fixed budget of training time. Finally, we use the reduction in training time that distributed training allows to experiment with adding higher-order dependency features to a state-of-the-art phrase-structure parsing model. We demonstrate that adding these features improves out-of-domain parsing results of even the strongest phrase-structure parsing models, yielding a new state-of-the-art for the popular train-test pairs considered. In addition, we show that a feature-bagging strategy, in which component models are trained separately and later combined, is sometimes necessary to avoid feature under-training and get the best performance out of large feature sets. 006.3
15	Semantic Search with Information Integration Xian, Yikun, Zhang, Liu January 2011 (has links) Since the search engine was first released in 1993, the development has never been slow down and various search engines emerged to vied for popularity. However, current traditional search engines like Google and Yahoo! are based on key words which lead to results impreciseness and information redundancy. A new search engine with semantic analysis can be the alternate solution in the future. It is more intelligent and informative, and provides better interaction with users. This thesis discusses the detail on semantic search, explains advantages of semantic search over other key-word-based search and introduces how to integrate semantic analysis with common search engines. At the end of this thesis, there is an example of implementation of a simple semantic search engine. Semantic Analysis Search Engine NLP J2EE
16	Typesafe NLP pipelines on Spark Hafner, Simon 24 February 2015 (has links) Natural language pipelines consist of various natural language algorithms that use the annotations of a previous algorithm to compute more annotations. These algorithms tend to be expensive in terms of computational power. Therefore it is advantageous to parallelize them in order to reduce the time necessary to analyze a large document collection. The goal of this project was to develop a new framework to encapsulate algorithms such that they may be used as part of a pipeline without any additional work. The framework consists of a custom-built data structure called Slab which implements type safety and functional transparency to integrate itself into the Scala programming language. Because of this integration, it is possible to use Spark, a MapReduce framework, to parallelize the pipeline on a cluster. To assess the performance of the new framework, a pipeline based on the OpenNLP library was created. An existing pipeline implemented in UIMA, an industry standard for natural language pipeline frameworks, served as a baseline in terms of performance. The pipeline created from the new framework processed the corpus in about half the time. / text Natural language processing NLP Pipelines Spark Slab
17	Advances in Newton-based Barrier Methods for Nonlinear Programming Wan, Wei 01 August 2017 (has links) Nonlinear programming is a very important tool for optimizing many systems in science and engineering. The interior point solver IPOPT has become one of the most popular solvers for NLP because of its high performance. However, certain types of problems are still challenging for IPOPT. This dissertation considers three improvements or extensions to IPOPT to improve performance on several practical classes of problems. Compared to active set solvers that treat inequalities by identifying active constraints and transforming to equalities, the interior point method is less robust in the presence of degenerate constraints. Interior point methods require certain regularity conditions on the constraint set for the solution path to exist. Dependent constraints commonly appear in applications such as chemical process models and violate the regularity conditions. The interior point solver IPOPT introduces regularization terms to attempt to correct this, but in some cases the required regularization terms either too large or too small and the solver will fail. To deal with these challenges, we present a new structured regularization algorithm, which is able to numerically delete dependent equalities in the KKT matrix. Numerical experiments on hundreds of modified example problems show the effectiveness of this approach with average reduction of more than 50% of the iterations. In some contexts such as online optimization, very fast solutions of an NLP are very important. To improve the performance of IPOPT, it is best to take advantage of problem structure. Dynamic optimization problems are often called online in a control or stateestimation. These problems are very large and have a particular sparse structure. This work investigates the use of parallelization to speed up the NLP solution. Because the KKT factorization is the most expensive step in IPOPT, this is the most important step to parallelize. Several cyclic reduction algorithms are compared for their performance on generic test matrices as well as matrices of the form found in dynamic optimization. The results show that for very large problems, the KKT matrix factorization time can be improved by a factor of four when using eight processors. Mathematical programs with complementarity constraints (MPCCs) are another challenging class of problems for IPOPT. Several algorithmic modifications are examined to specially handle the difficult complementarity constraints. First, two automatic penalty adjustment approaches are implemented and compared. Next, the use of our structured regularization is tested in combination with the equality reformulation of MPCCs. Then, we propose an altered equality reformulation of MPCCs which effectively removes the degenerate equality or inequality constraints. Using the MacMPEC test library and two applications, we compare the efficiency of our approaches to previous NLP reformulation strategies. Dynamic optimization Interior Point Method MPCC NLP
18	Automation of Medical Underwriting by Appliance of Machine Learning / AUTOMATISERING AV FÖRSÄKRINGSMEDICINSK UTREDNING GENOM TILLÄMPNING AV MASKININLÄRNING Rosén, Henrik January 2020 (has links) One of the most important fields regarding growth and development for mostorganizations today is the digitalization, or digital transformation. The offering oftechnological solutions to enhance existing, or create new, processes or products isemerging. That is, it’s of great importance that organizations continuously affirm thepotential of applying new technical solutions into their existing processes. For example, a well implemented AI solution for automation of an existing process is likely tocontribute with considerable business value.Medical underwriting for individual insurances, which is the process consideredin this project, is all about risk assessment based on the individuals medical record.Such task appears well suited for automation by a machine learning based applicationand would thereby contribute with substantial business value. However, to make aproper replacement of a manual decision making process, no important informationmight be excluded, which becomes rather challenging due to the fact that a considerable fraction of the information the medical records consists of unstructured textdata. In addition, the underwriting process is extremely sensible to mistakes regarding unnecessarily approve insurances where an enhanced risk of future claims can beassessed.Three algorithms, Logistic Regression, XGBoost and a Deep Learning model, wereevaluated on training data consisting of the medical records structured data from categorical and numerical answers, the text data as TF-IDF observation vectors, and acombination of both subsets of features. The XGBoost were the classifier performingbest according to the key metric, a pAUC over an FPR from 0 to 0.03.There is no question about the substantial importance of not to disregard anytype of information from the medical records when developing machine learning classifiers to predict the medical underwriting outcomes. At a very risk conservative andperformance pessimistic approach the best performing classifier did manage, if consider only the group of youngest kids (50% of sample), to recall close to 50% of allstandard risk applications at a false positive rate of 2%, when both structured andtext data were considered. Even though the structured data accounts for most of theexplanatory ability it becomes clear that the inclusive of the text data as TF-IDF observation vectors make for the differences needed to potentially generate a positivenet present value to an implementation of the model Mathematics Matematik
19	Natural Language Processing and Extracting Information From Medical Reports Pfeiffer II, Richard D. 29 June 2006 (has links) Submitted to the Health Informatics Graduate Program Faculty, Indiana University, in partial fulfillment of the requirements for the degree of Master of Science in Health Informatics.May 2006 / The purpose of this study is to examine the current use of natural language processing for extracting meaningful data from free text in medical reports. The use of natural language processing has been used to process information from various genres. To evaluate the use of natural language processing, a synthesized review of primary research papers specific to natural language processing and extracting data from medical reports. A three phased approach is used to describe the process of gathering the final metrics for validating the use of natural language processing. The main purpose of any NLP is to extract or understand human language and to process it into meaning for a specified area of interest or end-user. There are three types of approaches: symbolic, statistical, and connectionist. There are identified problems with natural language processing and the different approaches. Problems noted about natural language processing in the research are: acquisition, coverage, robustness, and extensibility. Metrics were gathered from primary research papers to evaluate the success of the natural language processors. Recall average of the four papers was 85%. Precision average of five papers was 87.7%. Accuracy average was 97%. Sensitivity average was 84%, while specificity was 97.4%. Based on the results of the primary research there was no definitive way to validate one NLP approach as an industry standard The research reviewed it is clear that there has been at least limited success with information extraction from free text with use of natural language processing. It is important to understand the continuum of data, information, and knowledge in the previous and future research of natural language processing. In the industry of health informatics this is a technology necessary for improving healthcare and research. Natural Language Processing NLP Medical Reporting Informatics
20	A Study of Transformer Models for Emotion Classification in Informal Text Esperanca, Alvaro Soares de Boa 12 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Textual emotion classification is a task in affective AI that branches from sentiment analysis and focuses on identifying emotions expressed in a given text excerpt. It has a wide variety of applications that improve human-computer interactions, particularly to empower computers to understand subjective human language better. Significant research has been done on this task, but very little of that research leverages one of the most emotion-bearing symbols we have used in modern communication: Emojis. In this thesis, we propose several transformer-based models for emotion classification that processes emojis as input tokens and leverages pretrained models and uses them , a model that processes Emojis as textual inputs and leverages DeepMoji to generate affective feature vectors used as reference when aggregating different modalities of text encoding. To evaluate ReferEmo, we experimented on the SemEval 2018 and GoEmotions datasets, two benchmark datasets for emotion classification, and achieved competitive performance compared to state-of-the-art models tested on these datasets. Notably, our model performs better on the underrepresented classes of each dataset. NLP Deep Learning Emotion Classification BERT Emojis

Search results