Global ETD Search

621	Generating Paraphrases with Greater Variation Using Syntactic Phrases Madsen, Rebecca Diane 01 December 2006 (has links) (PDF) Given a sentence, a paraphrase generation system produces a sentence that says the same thing but usually in a different way. The paraphrase generation problem can be formulated in the machine translation paradigm; instead of translation of English to a foreign language, the system translates an English sentence (for example) to another English sentence. Quirk et al. (2004) demonstrated this approach to generate almost 90% acceptable paraphrases. However, most of the sentences had little variation from the original input sentence. Leveraging syntactic information, this thesis project presents an approach that successfully generated more varied paraphrase sentences than the approach of Quirk et al. while maintaining coverage of the proportion of acceptable paraphrases generated. The ParaMeTer system (Paraphrasing by MT) identifies syntactic chunks in paraphrase sentences and substitutes labels for those chunks. This enables the system to generalize movements that are more syntactically plausible, as syntactic chunks generally capture sets of words that can change order in the sentence without losing grammaticality. ParaMeTer then uses statistical phrase-based MT techniques to learn alignments for the words and chunk labels alike. The baseline system followed the same pattern as the Quirk et al. system - a statistical phrase-based MT system. Human judgments showed that the syntactic approach and baseline both achieve approximately the same ratio of fluent, acceptable paraphrase sentences per fluent sentences. These judgments also showed that the ParaMeTer system has more phrase rearrangement than the baseline system. Though the baseline has more within-phrase alteration, future modifications such as a chunk-only translation model should improve ParaMeTer's variation for phrase alteration as well. paraphrase generation paraphrase sentential paraphrase syntax statistical machine translation machine translation natural language processing Computer Sciences
622	Extending the Information Partition Function: Modeling Interaction Effects in Highly Multivariate, Discrete Data Cannon, Paul C. 28 December 2007 (has links) (PDF) Because of the huge amounts of data made available by the technology boom in the late twentieth century, new methods are required to turn data into usable information. Much of this data is categorical in nature, which makes estimation difficult in highly multivariate settings. In this thesis we review various multivariate statistical methods, discuss various statistical methods of natural language processing (NLP), and discuss a general class of models described by Erosheva (2002) called generalized mixed membership models. We then propose extensions of the information partition function (IPF) derived by Engler (2002), Oliphant (2003), and Tolley (2006) that will allow modeling of discrete, highly multivariate data in linear models. We report results of the modified IPF model on the World Health Organization's Survey on Global Aging (SAGE). Information Partition Function interaction effects multivariate analysis discrete data Natural Language Processing Statistics and Probability
623	Traumatic Brain Injury Surveillance and Research with Electronic Health Records: Building New Capacities McFarlane, Timothy D. 03 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Between 3.2 and 5.3 million U.S. civilians live with traumatic brain injury (TBI)-related disabilities. Although the post-acute phase of TBI has been recognized as both a discrete disease process and risk factor for chronic conditions, TBI is not recognized as a chronic disease. TBI epidemiology draws upon untimely, incomplete, cross-sectional, administrative datasets. The adoption of electronic health records (EHR) may supplement traditional datasets for public health surveillance and research. Methods Indiana constructed a state-wide clinical TBI registry from longitudinal (2004-2018) EHRs. This dissertation includes three distinct studies to enhance, evaluate, and apply the registry: 1) development and evaluation of a natural language processing algorithm for identification of TBI severity within free-text notes; 2) evaluation and comparison of the performance of the ICD-9-CM and ICD-10-CM surveillance definitions; and 3) estimating the effect of mild TBI (mTBI) on the risk of post-acute chronic conditions compared to individuals without mTBI. Results Automated extraction of Glasgow Coma Scale from clinical notes was feasible and demonstrated balanced recall and precision (F-scores) for classification of mild (99.8%), moderate (100%), and severe (99.9%) TBI. We observed poor sensitivity for ICD-10-CM TBI surveillance compared to ICD-9-CM (0.212 and 0.601, respectively), resulting in potentially 5-fold underreporting. ICD-10-CM was not statistically equivalent to ICD-9-CM for sensitivity (𝑑𝑑𝑑𝑑̂=0.389, 95% CI [0.388,0.405]) or positive predictive value (𝑑𝑑𝑑𝑑̂=-0.353, 95% CI [-0.362,-0.344]). Compared to a matched cohort, individuals with mTBI were more likely to be diagnosed with mental health, substance use, neurological, cardiovascular, and endocrine conditions. Conclusion ICD-9-CM and ICD-10-CM surveillance definitions were not equivalent, and the transition resulted in a underreporting incidence for mTBI. This has direct implications on existing and future TBI registries and the Report to Congress on Traumatic Brain Injury in the United States. The supplementation of state-based trauma registries with structured and unstructured EHR data is effective for studying TBI outcomes. Our findings support the classification of TBI as a chronic disease by funding bodies, which may improve public funding to replace legacy systems to improve standardization, timeliness, and completeness of the epidemiology and post-acute outcomes of TBI. Chronic disease Epidemiology Natural language processing Public health informatics Public health surveillance Traumatic brain injury
624	Generation of Control Logic from Ordinary Speech Haghjo, Hamed, Vahlberg, Elias January 2022 (has links) Developments in automatic code generation are evolving remarkably fast, with companies and researchers competing to reach human-level accuracy and capability. Advancements in this field primarily focus on using machine learning models for end-to-end code generation. This project introduces the system CodeFromVoice, which explores an alternative method for code generation. This method relies on existing Natural Language Processing models combined with traditional parsing methods. CodeFromVoice shows that this approach can generate code from text or transcribed speech using Automatic Speech Recognition. The generated code is limited in complexity and restricted to the context of an existing application but achieves a Word Error Rate of less than 25%. / Utvecklingen av automatisk kodgenerering visar stora framsteg, med företag och forskare som tävlar om att nå mänsklig nivå av noggrannhet och förmåga. Framsteg inom detta område fokuserar främst på användning av maskininlärningsmodeller för hela kodgenerering processen. Detta projekt introducerar systemet CodeFromVoice, som utforskar en alternativ metod för kodgenerering. Denna metod bygger på befintliga NLP-modeller kombinerat med traditionella parsning metoder. CodeFromVoice visar att detta tillvägagångssätt kan generera kod från text eller transkriberat tal med automatisk taligenkänning. Den genererade koden är begränsad i komplexitet och begränsad till sammanhanget av en existerande applikation, men uppnår en ordfelfrekvens som är mindre än 25%. Code generation generation of code generation of control logic natural language processing Engineering and Technology Teknik och teknologier
625	Topic discovery and document similarity via pre-trained word embeddings Chen, Simin January 2018 (has links) Throughout the history, humans continue to generate an ever-growing volume of documents about a wide range of topics. We now rely on computer programs to automatically process these vast collections of documents in various applications. Many applications require a quantitative measure of the document similarity. Traditional methods first learn a vector representation for each document using a large corpus, and then compute the distance between two document vectors as the document similarity.In contrast to this corpus-based approach, we propose a straightforward model that directly discovers the topics of a document by clustering its words, without the need of a corpus. We define a vector representation called normalized bag-of-topic-embeddings (nBTE) to encapsulate these discovered topics and compute the soft cosine similarity between two nBTE vectors as the document similarity. In addition, we propose a logistic word importance function that assigns words different importance weights based on their relative discriminating power.Our model is efficient in terms of the average time complexity. The nBTE representation is also interpretable as it allows for topic discovery of the document. On three labeled public data sets, our model achieved comparable k-nearest neighbor classification accuracy with five stateof-art baseline models. Furthermore, from these three data sets, we derived four multi-topic data sets where each label refers to a set of topics. Our model consistently outperforms the state-of-art baseline models by a large margin on these four challenging multi-topic data sets. These works together provide answers to the research question of this thesis:Can we construct an interpretable document represen-tation by clustering the words in a document, and effectively and efficiently estimate the document similarity? / Under hela historien fortsätter människor att skapa en växande mängd dokument om ett brett spektrum av publikationer. Vi förlitar oss nu på dataprogram för att automatiskt bearbeta dessa stora samlingar av dokument i olika applikationer. Många applikationer kräver en kvantitativmått av dokumentets likhet. Traditionella metoder först lära en vektorrepresentation för varje dokument med hjälp av en stor corpus och beräkna sedan avståndet mellan two document vektorer som dokumentets likhet.Till skillnad från detta corpusbaserade tillvägagångssätt, föreslår vi en rak modell som direkt upptäcker ämnena i ett dokument genom att klustra sina ord , utan behov av en corpus. Vi definierar en vektorrepresentation som kallas normalized bag-of-topic-embeddings (nBTE) för att inkapsla de upptäckta ämnena och beräkna den mjuka cosinuslikheten mellan två nBTE-vektorer som dokumentets likhet. Dessutom föreslår vi en logistisk ordbetydelsefunktion som tilldelar ord olika viktvikter baserat på relativ diskriminerande kraft.Vår modell är effektiv när det gäller den genomsnittliga tidskomplexiteten. nBTE-representationen är också tolkbar som möjliggör ämnesidentifiering av dokumentet. På tremärkta offentliga dataset uppnådde vår modell jämförbar närmaste grannklassningsnoggrannhet med fem toppmoderna modeller. Vidare härledde vi från de tre dataseten fyra multi-ämnesdatasatser där varje etikett hänvisar till en uppsättning ämnen. Vår modell överensstämmer överens med de högteknologiska baslinjemodellerna med en stor marginal av fyra utmanande multi-ämnesdatasatser. Dessa arbetsstöd ger svar på forskningsproblemet av tisthesis:Kan vi konstruera en tolkbar dokumentrepresentation genom att klustra orden i ett dokument och effektivt och effektivt uppskatta dokumentets likhet? Computer and Information Sciences Data- och informationsvetenskap
626	Automatic Reference Resolution for Pedestrian Wayfinding Systems / Automatisk referenslösning i navigationssystem förfotgängare Kalpakchi, Dmytro January 2018 (has links) Imagine that you are in the new city and want to explore it. Trying to navigate with maps leads to the unnecessary confusion about street names and prevents you from a enjoying a wonderful walk. The dialogue system that could navigate you from by means of a simple conversation using salient landmarks in your immediate vicinity would be much more helpful! Developing such dialogue system is non-trivial and requires solving a lot of complicated tasks. One of such tasks, tackled in the present thesis, is called reference resolution (RR), i.e. resolving utterances to the underlying geographical entities, referents (if any). The utterances that have referent(s) are called referring expressions (REs). The RR task is decomposed into two tasks: RE identification and resolution itself. Neural network models for both tasks have been designed and extensively evaluated. The model for RE identification, called RefNet, utilizes recurrent neural networks (RNNs) for handling sequential input, i.e. phrases. For each word in an utterance, RefNet outputs a label indicating whether this word is in the beginning of the RE, inside or outside it. The reference resolution model, called SpaceRefNet, uses the RefNet's RNN layer to encode REs and the designed feature extractor to represent geographical objects. Both encodings are fed to a simple feed-forward network with a softmax prediction layer, yielding the probability of match between the RE and the geographical object. Both introduced models have beaten the respective baselines and show promising results in general. / Tänk dig att du är i en ny stad och vill känna staden bättre. Du försöker att använda kartor, men blir förvirrad av gatunamn och kan inte njuta av din promenad. Ett dialogsystem, som kan hjälpa dig att navigera med hjälp av talade instruktioner, och som använder sig av framträdande landmärken i din närhet skulle vara mer användbart! Att utveckla ett sådant system är mycket komplicerat och man behöver att lösa ett antal mycket svåra uppgifter. En av dessa uppgifter kallas referenslösning (RR), vilket innebär att associera refererande fraser (RE) i yttranden till de geografiska objekt som avses. RR har brutits ner i två deluppgifter: identifiering av RE i yttranden, och referenslösning av dessa RE. Neurala-nätverksmodeller har utformats och utvärderats för båda uppgifterna. Modellen för identifiering av RE kallas RefNet och använder återkopplande neuronnät (RNN) för att behandla sekventiellindata, d.v.s. fraser. Varje ord i ett yttrande klassificeras av RefNet som en av tre följande kategorier: “i början av RE”, “i mitten av RE” samt “utanför RE”. Modellen för RR kallas SpaceRefNet och använder RefNets RNN-lager för att representera RE, samt en designad särdragsextraktor för att koda geografiska objekt. Båda kodningarna används som indata för ett enkelt framåtmatande neuronnät med ett avslutande softmax-lager. Det avslutande lagret producerar en sannolikhet att en viss RE motsvarar det geografiska objektet i fråga. Båda modellerna fungerade bättre än respektive baslinjemodeller, och visar lovande resultat i allmänhet. / Уявiть, що Ви опинилися у мiстi, яке нiколи не вiдвiдували. Ви хочете побачити все, що мiсто може Вам запропонувати, але не знаєте нiкого, хто може з цим допомогти. Назви вулиць на електронних картах не тiльки не допомагають, а ще й заплутують Вас, заважаючи отримувати насолоду вiд чудової прогулянки. Було б набагато зручнiше, якщо Ви могли б говорити з дiалоговою системою, як Ви говорите з друзями. Така система допомагала б Вам орiєнтуватися, використовуючи помiтнi орiєнтири у Вашому оточеннi. Розробка такої системи включає в себе багато нетривiальних задач, одна з яких називається задача розв’язання географiчних посилань (РГП). Словосполучення, вживанi з метою вказати на специфiчний географiчний об’єкт, є досить розповсюдженими у повсякденнiй мовi. Такi словосполучення називаються географiчними посиланнями (ГП), а географiчнi об’єкти, на якi вони посилаються - референтами. Задача розв’язання географiчних посилань полягає у спiвставленнi їх з вiдповiдними референтами.У рамках даної дипломної роботи задача РГП була декомпозована на двi частини: iдентифiкацiя географiчних посилань (IГП) та власне розв’язання (ВРГП). Для вирiшення обох задач було розроблено, протестовано та оцiнено вiдповiднi нейроннi мережi. Модель для розв’язання задачi IГП називається RefNet та використовує рекурентнi нейроннi мережi, щоб мати змогу обробляти послiдовнi вхiднi данi, як-то фрази. RefNet аналiзує висловлене речення дослiвно та визначає для кожного слова чи воно знаходиться на початку, всерединi чи поза ГП. Модель для розв’язання задачi ВРГП називається SpaceRefNet та використовує рекурентний шар RefNet для представлення поданих на вхiд ГП. Географiчнi об’єкти представляються за допомогою розробленого алгоритму видiляння ознак. Обидва представлення подаються на вхiд простої нейронної мережi прямого поширення з кiнцевим шаром softmax, який обчислює ймовiрнiсть того, що подане ГП описує поданий географiчний об’єкт.Обидвi мережi показали гарний результат, кращий за вiдповiднi базовi моделi. Результати загалом показують, що використання нейронних мереж для вирiшення задачi розв’язання географiчних посилань – це перспективний напрям для майбутнiх дослiджень. Reference Resolution Pedestrian Wayfinding Systems Natural Language Processing Computer Sciences Datavetenskap (datalogi)
627	Keyword Extraction from Swedish Court Documents / Extraktion av nyckelord från svenska rättsdokument Grosz, Sandra January 2020 (has links) This thesis addresses the problem of extracting keywords which represent the rulings and and grounds for the rulings in Swedish court documents. The problem of identifying the candidate keywords was divided into two steps; first preprocessing the documents and second extracting keywords using a keyword extraction algorithm on the preprocessed documents. The preprocessing methods used in conjunction with the keywords extraction algorithms were that of using stop words and a stemmer. Then, three different approaches for extracting keywords were used; one statistic approach, one machine learning approach and lastly one graph-based approach. The three different approaches used to extract keywords were then evaluated to measure the quality of the keywords and the rejection rate of keywords which were not of a high enough quality. Out of the three approaches implemented and evaluated the results indicated that the graph-based approach showed the most promise. However, the results also showed that neither of the three approaches had a high enough accuracy to be used without human supervision. / Detta examensarbete behandlar problemet om att extrahera nyckelord som representerar domslut och domskäl ur svenska rättsdokument. Problemet med att identifiera möjliga nyckelord delades upp i två steg; det första steget är att använda förbehandlingsmetoder och det andra steget att extrahera nyckelord genom att använda en algoritm för nyckelordsextraktion. Förbehandlingsmetoderna som användes tillsammans med nyckelordsextraktionsalgoritmerna var stoppord samt avstammare. Sedan användes tre olika metoder för att extrahera nyckelord; en statistisk, en maskininlärningsbaserad och slutligen en grafbaserad. De tre metoderna för att extrahera nyckelord blev sedan evaluerade för att kunna mäta kvaliteten på nyckelorden samt i vilken grad nyckelord som inte var av tillräckligt hög kvalitet förkastades. Av de tre implementerade och evaluerade tillvägagångssätten visade den grafbaserade metoden mest lovande resultat. Däremot visade resultaten även att ingen av de tre metoderna hade en tillräckligt hög riktighet för att kunna användas utan mänsklig övervakning. Keywords extraction Information Retrieval Natural Language Processing nyckelordsextraktion informationssökning naturligt språkbehandling. Computer and Information Sciences Data- och informationsvetenskap
628	Exploring the Potential of Twitter Data and Natural Language Processing Techniques to Understand the Usage of Parks in Stockholm / Utforska potentialen för användning av Natural Language Processing på Twitter data för att förstå användningen av parker i Stockholm Norsten, Theodor January 2020 (has links) Traditional methods used to investigate the usage of parks consists of questionnaire which is both a very time- and- resource consuming method. Today more than four billion people daily use some form of social media platform. This has led to the creation of huge amount of data being generated every day through various social media platforms and has created a potential new source for retrieving large amounts of data. This report will investigate a modern approach, using Natural Language Processing on Twitter data to understand how parks in Stockholm being used. Natural Language Processing (NLP) is an area within artificial intelligence and is referred to the process to read, analyze, and understand large amount of text data and is considered to be the future for understanding unstructured text. Twitter data were obtained through Twitters open API. Data from three parks in Stockholm were collected between the periods 2015-2019. Three analysis were then performed, temporal, sentiment, and topic modeling analysis. The results from the above analysis show that it is possible to understand what attitudes and activities are associated with visiting parks using NLP on social media data. It is clear that sentiment analysis is a difficult task for computers to solve and it is still in an early stage of development. The results from the sentiment analysis indicate some uncertainties. To achieve more reliable results, the analysis would consist of much more data, more thorough cleaning methods and be based on English tweets. One significant conclusion given the results is that people’s attitudes and activities linked to each park are clearly correlated with the different attributes each park consists of. Another clear pattern is that the usage of parks significantly peaks during holiday celebrations and positive sentiments are the most strongly linked emotion with park visits. Findings suggest future studies to focus on combining the approach in this report with geospatial data based on a social media platform were users share their geolocation to a greater extent. / Traditionella metoder använda för att förstå hur människor använder parker består av frågeformulär, en mycket tids -och- resurskrävande metod. Idag använder mer en fyra miljarder människor någon form av social medieplattform dagligen. Det har inneburit att enorma datamängder genereras dagligen via olika sociala media plattformar och har skapat potential för en ny källa att erhålla stora mängder data. Denna undersöker ett modernt tillvägagångssätt, genom användandet av Natural Language Processing av Twitter data för att förstå hur parker i Stockholm används. Natural Language Processing (NLP) är ett område inom artificiell intelligens och syftar till processen att läsa, analysera och förstå stora mängder textdata och anses vara framtiden för att förstå ostrukturerad text. Data från Twitter inhämtades via Twitters öppna API. Data från tre parker i Stockholm erhölls mellan perioden 2015–2019. Tre analyser genomfördes därefter, temporal, sentiment och topic modeling. Resultaten från ovanstående analyser visar att det är möjligt att förstå vilka attityder och aktiviteter som är associerade med att besöka parker genom användandet av NLP baserat på data från sociala medier. Det är tydligt att sentiment analys är ett svårt problem för datorer att lösa och är fortfarande i ett tidigt skede i utvecklingen. Resultaten från sentiment analysen indikerar några osäkerheter. För att uppnå mer tillförlitliga resultat skulle analysen bestått av mycket mer data, mer exakta metoder för data rensning samt baserats på tweets skrivna på engelska. En tydlig slutsats från resultaten är att människors attityder och aktiviteter kopplade till varje park är tydligt korrelerat med de olika attributen respektive park består av. Ytterligare ett tydligt mönster är att användandet av parker är som högst under högtider och att positiva känslor är starkast kopplat till park-besök. Resultaten föreslår att framtida studier fokuserar på att kombinera metoden i denna rapport med geospatial data baserat på en social medieplattform där användare delar sin platsinfo i större utsträckning. Natural Language Processing Sentiment analysis Topic modeling Twitter VADER LDA Other Social Sciences Annan samhällsvetenskap
629	Using semantic folding with TextRank for automatic summarization / Semantisk vikning med TextRank för automatisk sammanfattning Karlsson, Simon January 2017 (has links) This master thesis deals with automatic summarization of text and how semantic folding can be used as a similarity measure between sentences in the TextRank algorithm. The method was implemented and compared with two common similarity measures. These two similarity measures were cosine similarity of tf-idf vectors and the number of overlapping terms in two sentences. The three methods were implemented and the linguistic features used in the construction were stop words, part-of-speech filtering and stemming. Five different part-of-speech filters were used, with different mixtures of nouns, verbs, and adjectives. The three methods were evaluated by summarizing documents from the Document Understanding Conference and comparing them to gold-standard summarization created by human judges. Comparison between the system summaries and gold-standard summaries was made with the ROUGE-1 measure. The algorithm with semantic folding performed worst of the three methods, but only 0.0096 worse in F-score than cosine similarity of tf-idf vectors that performed best. For semantic folding, the average precision was 46.2% and recall 45.7% for the best-performing part-of-speech filter. / Det här examensarbetet behandlar automatisk textsammanfattning och hur semantisk vikning kan användas som likhetsmått mellan meningar i algoritmen TextRank. Metoden implementerades och jämfördes med två vanliga likhetsmått. Dessa två likhetsmått var cosinus-likhet mellan tf-idf-vektorer samt antal överlappande termer i två meningar. De tre metoderna implementerades och de lingvistiska särdragen som användes vid konstruktionen var stoppord, filtrering av ordklasser samt en avstämmare. Fem olika filter för ordklasser användes, med olika blandningar av substantiv, verb och adjektiv. De tre metoderna utvärderades genom att sammanfatta dokument från DUC och jämföra dessa mot guldsammanfattningar skapade av mänskliga domare. Jämförelse mellan systemsammanfattningar och guldsammanfattningar gjordes med måttet ROUGE-1. Algoritmen med semantisk vikning presterade sämst av de tre jämförda metoderna, dock bara 0.0096 sämre i F-score än cosinus-likhet mellan tf-idf-vektorer som presterade bäst. För semantisk vikning var den genomsnittliga precisionen 46.2% och recall 45.7% för det ordklassfiltret som presterade bäst. automatic summarization natural language processing TextRank semantic folding extractive document summarization Computer Sciences Datavetenskap (datalogi)
630	Deep Text Mining of Instagram Data Without Strong Supervision / Textutvinning från Instagram utan Precis Övervakning Hammar, Kim January 2018 (has links) With the advent of social media, our online feeds increasingly consist of short, informal, and unstructured text. This data can be analyzed for the purpose of improving user recommendations and detecting trends. The grand volume of unstructured text that is available makes the intersection of text processing and machine learning a promising avenue of research. Current methods that use machine learning for text processing are in many cases dependent on annotated training data. However, considering the heterogeneity and variability of social media, obtaining strong supervision for social media data is in practice both difficult and expensive. In light of this limitation, a belief that has put its marks on this thesis is that the study of text mining methods that can be applied without strong supervision is of a higher practical interest. This thesis investigates unsupervised methods for scalable processing of text from social media. Particularly, the thesis targets a classification and extraction task in the fashion domain on the image-sharing platform Instagram. Instagram is one of the largest social media platforms, containing both text and images. Still, research on text processing in social media is to a large extent limited to Twitter data, and little attention has been paid to text mining of Instagram data. The aim of this thesis is to broaden the scope of state-of-the-art methods for information extraction and text classification to the unsupervised setting, working with informal text on Instagram. Its main contributions are (1) an empirical study of text from Instagram; (2) an evaluation of word embeddings for Instagram text; (3) a distributed implementation of the FastText algorithm; (4) a system for fashion attribute extraction in Instagram using word embeddings; and (5) a multi-label clothing classifier for Instagram text, built with deep learning techniques and minimal supervision. The empirical study demonstrates that the text distribution on Instagram exhibits the long-tail phenomenon, that the text is just as noisy as have been reported in studies on Twitter text, and that comment sections are multi-lingual. In experiments with word embeddings for Instagram, the importance of hyperparameter tuning is manifested and a mismatch between pre-trained embeddings and social media is observed. Furthermore, that word embeddings are a useful asset for information extraction is confirmed. Experimental results show that word embeddings beats a baseline that uses Levenshtein distance on the task of extracting fashion attributes from Instagram. The results also show that the distributed implementation of FastText reduces the time it takes to train word embeddings with a factor that scales with the number of machines used for training. Finally, our research demonstrates that weak supervision can be used to train a deep classifier, achieving an F1 score of 0.61 on the task of classifying clothes in Instagram posts based only on the associated text, which is on par with human performance. / I och med uppkomsten av sociala medier så består våra online-flöden till stor del av korta och informella textmeddelanden, denna data kan analyseras med syftet att upptäcka trender och ge användarrekommendationer. Med tanke på den stora volymen av ostrukturerad text som finns tillgänglig så är kombinationen av språkteknologi och maskinlärning ett forskningsområde med stor potential. Nuvarande maskinlärningsteknologier för textbearbetning är i många fall beroende av annoterad data för träning. I praktiken så är det dock både komplicerat och dyrt att anskaffa annoterad data av hög kvalitet, inte minst vad gäller data från sociala medier, med tanke på hur pass föränderlig och heterogen sociala medier är som datakälla. En övertygelse som genomsyrar denna avhandling är att textutvinnings metoder som inte kräver precis övervakning har större potential i praktiken. Denna avhandling undersöker oövervakade metoder för skalbar bearbetning av text från sociala medier. Specifikt så täcker avhandlingen ett komplext klassifikations- och extraktions- problem inom modebranschen på bilddelningsplattformen Instagram. Instagram hör till de mest populära sociala plattformarna och innehåller både bilder och text. Trots det så är forskning inom textutvinning från sociala medier till stor del begränsad till data från Twitter och inte mycket uppmärksamhet har givits de stora möjligheterna med textutvinning från Instagram. Ändamålet med avhandlingen är att förbättra nuvarande metoder som används inom textklassificering och informationsextraktion, samt göra dem applicerbara för oövervakad maskinlärning på informell text från Instagram. De primära forskningsbidragen i denna avhandling är (1) en empirisk studie av text från Instagram; (2) en utvärdering av ord-vektorer för användning med text från Instagram; (3) en distribuerad implementation av FastText algoritmen; (4) ett system för extraktion av kläddetaljer från Instagram som använder ord-vektorer; och (5) en flerkategorisk kläd-klassificerare för text från Instagram, utvecklad med djupinlärning och minimal övervakning. Den empiriska studien visar att textdistributionen på Instagram har en lång svans, att texten är lika informell som tidigare rapporterats från studier på Twitter, samt att kommentarssektionerna är flerspråkiga. Experiment med ord-vektorer för Instagram understryker vikten av att justera parametrar före träningsprocessen, istället för att använda förbestämda värden. Dessutom visas att ord-vektorer tränade på formell text är missanpassade för applikationer som bearbetar informell text. Vidare så påvisas att ord-vektorer är effektivt för informationsextraktion i sociala medier, överlägsen ett standardvärde framtaget med informationsextraktion baserat på syntaktiskt ordlikhet. Resultaten visar även att den distribuerade implementationen av FastText kan minska tiden det tar att träna ord-vektorer med en faktor som beror på antalet maskiner som används i träningen. Slutligen, vår forskning indikerar att svag övervakning kan användas för att träna en klassificerare med djupinlärning. Den tränade klassificeraren uppnår ett F1 resultat av 0.61 på uppgiften att klassificera kläddetaljer av bilder från Instagram, baserat endast på bildtexten och tillhörande användarkommentarer, vilket är i nivå med mänsklig förmåga. Natural Language Processing Information Extraction Machine Learning Språkteknologi Informationsextraktion Maskinlärning Computer Systems Datorsystem

Search results