Global ETD Search

11	Um método para extração de palavras-chave de documentos representados em grafos Abilhoa, Willyan Daniel 05 February 2014 (has links) Made available in DSpace on 2016-03-15T19:37:48Z (GMT). No. of bitstreams: 1 Willyan Daniel Abilhoa.pdf: 1956528 bytes, checksum: 5d317e6fd19aebfc36180735bcf6c674 (MD5) Previous issue date: 2014-02-05 / Fundação de Amparo a Pesquisa do Estado de São Paulo / Twitter is a microblog service that generates a huge amount of textual content daily. All this content needs to be explored by means of techniques, such as text mining, natural language processing and information retrieval. In this context, the automatic keyword extraction is a task of great usefulness that can be applied to indexing, summarization and knowledge extrac-tion from texts. A fundamental step in text mining consists of building a text representation model. The model known as vector space model, VSM, is the most well-known and used among these techniques. However, some difficulties and limitations of VSM, such as scalabil-ity and sparsity, motivate the proposal of alternative approaches. This dissertation proposes a keyword extraction method, called TKG (Twitter Keyword Graph), for tweet collections that represents texts as graphs and applies centrality measures for finding the relevant vertices (keywords). To assess the performance of the proposed approach, two different sets of exper-iments are performed and comparisons with TF-IDF and KEA are made, having human clas-sifications as benchmarks. The experiments performed showed that some variations of TKG are invariably superior to others and to the algorithms used for comparisons. / O Twitter é um serviço de microblog que gera um grande volume de dados textuais. Todo esse conteúdo precisa ser explorado por meio de técnicas de mineração de textos, processamento de linguagem natural e recuperação de informação com o objetivo de extrair um conhecimento que seja útil de alguma forma ou em algum processo. Nesse contexto, a extração automática de palavras-chave é uma tarefa que pode ser usada para a indexação, sumarização e compreensão de documentos. Um passo fundamental nas técnicas de mineração de textos consiste em construir um modelo de representação de documentos. O modelo chamado mode-lo de espaço vetorial, VSM, é o mais conhecido e utilizado dentre essas técnicas. No entanto, algumas dificuldades e limitações do VSM, tais como escalabilidade e esparsidade, motivam a proposta de abordagens alternativas. O presente trabalho propõe o método TKG (Twitter Keyword Graph) de extração de palavras-chave de coleções de tweets que representa textos como grafos e aplica medidas de centralidade para encontrar vértices relevantes, correspondentes às palavras-chave. Para medir o desempenho da abordagem proposta, dois diferentes experimentos são realizados e comparações com TF-IDF e KEA são feitas, tendo classifica-ções humanas como referência. Os experimentos realizados mostraram que algumas variações do TKG são superiores a outras e também aos algoritmos usados para comparação. mineração de textos representação de textos em grafo extração de palavras-chave medidas de centralidade text mining text representation in graphs keyword extraction centrality measures CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA
12	Extrakce klíčových slov z dokumentů / Keyword Extraction from Documents Matička, Jiří January 2012 (has links) This thesis pursues an automated extraction of keywords from documents. Its goal is to design and implement an application which will be able to extract an appropriate set of keywords related to the contents of the document. The major requirements for the application are speed and accuracy. That is why the first part of the thesis talks about already developed principles and a detailed classification based on various criteria. The second part is focused on choosing and a thorough functional describing of one of the methods which should have been used for extracting the keywords. The next parts contain a detailed draft of the application and its implementation. Finally, the last chapter is particularly important due to testing the application on a group of text documents and evaluating final results of the extraction process.
13	Data Fusion and Text Mining for Supporting Journalistic Work Zsombor, Vermes January 2022 (has links) During the past several decades, journalists have been struggling with the ever growing amount of data on the internet. Investigating the validity of the sources or finding similar articles for a story can consume a lot of time and effort. These issues are even amplified by the declining size of the staff of news agencies. The solution is to empower the remaining professional journalists with digital tools created by computer scientists. This thesis project is inspired by an idea to provide software support for journalistic work with interactive visual interfaces and artificial intelligence. More specifically, within the scope of this thesis project, we created a backend module that supports several text mining methods such as keyword extraction, named entity recognition, sentiment analysis, fake news classification and also data collection from various data sources to help professionals in the field of journalism. To implement our system, first we gathered the requirements from several researchers and practitioners in journalism, media studies, and computer science, then acquired knowledge by reviewing literature on current approaches. Results are evaluated both with quantitative methods such as individual component benchmarks and also with qualitative methods by analyzing the outcomes of the semi-structured interviews with collaborating and external domain experts. Our results show that there is similarity between the domain experts' perceived value and the performance of the components on the individual evaluations. This shows us that there is potential in this research area and future work would be welcomed by the journalistic community. text mining data fusion algorithmic journalism computational journalism keyword extraction named entity recognition sentiment analysis fake news classification Computer and Information Sciences Data- och informationsvetenskap
14	Text Simpliﬁcation and Keyphrase Extraction for Swedish Lindqvist, Ellinor January 2019 (has links) Attempts have been made in Sweden to increase readability for texts addressed to the public, and ongoing projects are still being conducted by disability associations, private companies and Swedish authorities. In this thesis project, we explore automatic approaches to increase readability trough text simpliﬁcation and keyphrase extraction, with the goal of facilitating text comprehension and readability for people with reading difﬁculties. A combination of handwritten rules and monolingual machine translation was used to simplify the syntactic and lexical content of Swedish texts, and noun phrases were extracted to provide the reader with a short summary of the textual content. A user evaluation was conducted to compare the original and the simpliﬁed version of the same text. Several texts and their simpliﬁed versions were also evaluated using established readability metrics. Although a manual evaluation of the result showed that the implemented rules generally worked as intended on the sentences that were targeted, the results from the user evaluation and readability metrics did not show improvements. We believe that further additions to the rule set, targeting a wider range of linguistic structures, have the potential to improve the results. text simplification keyword extraction Swedish monolingual translation synonym replacement user evaluation readability LIX OVIX NR
15	A Platform for Aligning Academic Assessments to Industry and Federal Job Postings Parks, Tyler J. 07 1900 (has links) The proposed tool will provide users with a platform to access a side-by-side comparison of classroom assessment and job posting requirements. Using techniques and methodologies from NLP, machine learning, data analysis, and data mining: the employed algorithm analyzes job postings and classroom assessments, extracts and classifies skill units within, then compares sets of skills from different input volumes. This effectively provides a predicted alignment between academic and career sources, both federal and industrial. The compilation of tool results indicates an overall accuracy score of 82%, and an alignment score of only 75.5% between the input assessments and overall job postings. These results describe that the 50 UNT assessments and 5,000 industry and federal job postings examined, demonstrate a compatibility (alignment) of 75.5%; and, that this measure was calculated using a tool operating at an 82% precision rate. academia assessment industry federal job posting natural language processing machine learning data analysis data mining web scraping keyword extraction text classification learning outcome text comparison Computer Science
16	AI Enabled Cloud RAN Test Automation : Automatic Test Case Prediction Using Natural Language Processing and Machine Learning Techniques / AI Cloud RAN test automatisering : Automatisk generering av testfall med hjälp av naturlig språkbehandling och maskininlärningstekniker Santosh Nimbhorkar, Jeet January 2023 (has links) The Cloud Radio Access Network (RAN) is a technology used in the telecommunications industry. It provides a flexible, scalable, and costeffective solution for managing and delivering seamless wireless network services. However, the testing of Cloud RAN applications poses formidable challenges due to its complex nature, resulting in potential delays in product delivery and amplified costs. Using the power of test automation is an approach to tackling these challenges. By automating the testing process, we can reduce manual efforts, enhance the accuracy and efficiency of testing procedures, and ultimately expedite the delivery of high-quality products. In this era of cutting-edge advancements, artificial intelligence (AI) and machine learning (ML) can be used to aid Cloud RAN testing. These technologies empower us to swiftly identify and address complex issues. The goal of this thesis is to have a data-driven approach toward Cloud RAN test automation. Machine learning along with natural language processing techniques are used to automatically predict test cases from test instructions. The test instructions are analyzed and keywords are extracted from them using natural language processing techniques. The performance of two keyword extraction techniques is compared. SpaCy was the best-performing keyword extractor. Test script prediction from these keywords is done using two approaches; using test script names and using test script contents. Random Forest was the best performing model for both these approaches when the data were oversampled and when it was undersampled as well. / Cloud Radio Access Network (RAN) är en revolutionerande teknik som används inom telekommunikationsindustrin. Det ger en flexibel, skalbar och kostnadseffektiv lösning för att hantera och leverera sömlösa trådlösa nätverkstjänster. Testningen av Cloud RAN-applikationer innebär dock enorma utmaningar på grund av dess komplexa natur, vilket resulterar i potentiella förseningar i produktleverans och förstärkta kostnader. Att använda kraften i testautomatisering är en avgörande metod för att tackla dessa utmaningar. Genom att automatisera testprocessen kan vi dramatiskt minska manuella ansträngningar, avsevärt förbättra noggrannheten och effektiviteten i testprocedurerna och i slutändan påskynda leveransen av högkvalitativa produkter. I denna era av banbrytande framsteg kan artificiell intelligens (AI) och maskininlärning (ML) användas för att revolutionera Cloud RAN-testning. Dessa banbrytande teknologier ger oss möjlighet att snabbt identifiera och ta itu med komplexa problem. Målet med detta examensarbete är att ha ett datadrivet förhållningssätt till Cloud RAN-testautomatisering. Maskininlärning tillsammans med naturliga språkbehandlingstekniker används för att automatiskt generera testfall från testinstruktioner. Testinstruktionerna analyseras och nyckelord extraheras från dem med hjälp av naturliga språkbehandlingstekniker. Resultatet av två sökordsextraktionstekniker jämförs. SpaCy var den bäst presterande sökordsextraktorn. Förutsägelse av testskript från dessa nyckelord görs med två metoder; använda testskriptnamn och använda testskriptinnehåll. Random forests var den bäst presterande modellen för båda dessa tillvägagångssätt när data överstämplades och även undersamplades. Test Automation Natural Language Processing Machine Learning Keyword Extraction Prediction Testautomatisering Naturlig Språkbehandling Maskininlärning Nyckelord Extraktion Förutsägelse Computer and Information Sciences Data- och informationsvetenskap
17	以型態組合為主的關鍵詞擷取技術在學術寫作字彙上的研究 / A pattern approach to keyword extraction for academic writing vocabulary 邵智捷, Shao, Chih Chieh Unknown Date (has links) 隨著時間的推移演進，人們瞭解到將知識經驗著作成文獻典籍保存下來供後人研究開發的重要性。時至今日，以英語為主的學術寫作論文成為全世界最主要的研究交流媒介。而對於英語為非母語的研究專家而言，在進行英語學術寫作上常常會遇到用了不適當的字彙或搭配詞導致無法確切的傳達自己的研究成果，或是在表達上過於貧乏的問題，因此英語學術寫作字彙與搭配詞的學習與使用就顯得相當重要。在本研究中，我們藉由收集大量不同國家以及不同研究領域的學術論文為基礎，建構現實中實際使用的語料庫，並且建立數種詞性標籤型態，使用關鍵詞擷取關鍵詞擷取(Keyword Extraction)技術從中擷取出學術著作中常用的學術寫作字彙候選詞，當作是學術常用寫作字彙之初步結果，隨即將候選詞導入關鍵詞分析的指標形態模型，將候選詞依照指標特徵選出具有代表指標意義的進一步候選詞。在實驗方面，透過對不同範圍的樣本資料進行篩選，並導入統計上的方法對字彙進行不同領域共通性的分析檢證，再加上輔助篩選的機制後，最後求得名詞和動詞分別在學術寫作中常用的字彙，也以此字彙為基礎，發掘出語料庫中常用的搭配詞組合，提出以英語為外國語的研究學者以及學生在學術寫作上的常用字彙與搭配詞組合作為參考，在學術寫作上能夠提供更多樣性且正確的研究論述的協助。 / With the evolution over time, people start to know the importance of taking their knowledge and experience into literature texts and preserving them for future research. Until now, academic writing research papers mainly in English become the world’s leading communication media all over the world. For those non-native English researchers, they often encounter with the inappropriate vocabularies or collocations which causes them not to pass on their idea accurately or to express their research poorly. As a result, it’s very important to know how to learn or to use the correct academic writing in English vocabularies and collocations. In this study, we constructed the real academic thesis corpus which includes different countries and fields of academic research. The keyword extraction technique based on the several Part-of-Speech tag patterns is used for capturing the common academic writing vocabulary candidates in the academic works to be the initial result of the common vocabulary of academic writing. The candidate words would be introduced to the index analysis model of keyword and be picked out to the further meaningful candidate words according to the index characteristics. For the experiments, the sample data with different fields would be filtered and the vocabularies on different fields of commonality would be analyzed and verified through statistical methods. Moreover, the auxiliary filter mechanism would also be applied to get the common vocabularies in academic writing with nouns and verbs. Based on these vocabularies, we could discover the common combination with the words in the academic thesis corpus and provide them to the non-native English researchers and students as a reference with the common vocabularies and collocations in academic writing. Hopefully the study could help them to write more rich and correct research papers in the future. 關鍵字擷取英語學習學術字彙學術字彙列表詞性標籤型態 keyword extraction English learning academic vocabulary academic word list AWL PoS tag patterns
18	Ανάπτυξη μεθόδου με σκοπό την αναγνώριση και εξαγωγή θεματικών λέξεων κλειδιών από διευθύνσεις ιστοσελίδων του ελληνικού Διαδικτύου / Keyword identification within Greek URLs Βονιτσάνου, Μαρία-Αλεξάνδρα 16 January 2012 (has links) Η αύξηση της διαθέσιμης Πληροφορίας στον Παγκόσμιο Ιστό είναι ραγδαία. Η παρατήρηση αυτή παρότρυνε πολλούς ερευνητές να επικεντρώσουν το έργο τους στην εξαγωγή χρήσιμων γνωρισμάτων από διαδικτυακά έγγραφα, όπως ιστοσελίδες, εικόνες, βίντεο, με σκοπό τη ενίσχυση της διαδικασίας κατηγοριοποίησης ιστοσελίδων. Ένας πόρος που περιέχει πληροφορία και δεν έχει διερευνηθεί διεξοδικά για γλώσσες εκτός της αγγλικής, είναι η διεύθυνση ιστοσελίδας (URL- Uniform Recourse Locator). Το κίνητρο της διπλωματικής αυτής εργασίας είναι το γεγονός ότι ένα σημαντικό υποσύνολο των χρηστών του διαδικτύου δείχνει ενδιαφέρον για δικτυακούς πόρους, των οποίων οι διευθύνσεις URL περιλαμβάνουν όρους προερχόμενους από τη μητρική τους γλώσσα (η οποία δεν είναι η αγγλική), γραμμένους με λατινικούς χαρακτήρες. Προτείνεται μέθοδος η οποία θα αναγνωρίζει και θα εξάγει τις λέξεις-κλειδιά από διευθύνσεις ιστοσελίδων (URLs), εστιάζοντας στο ελληνικό Διαδίκτυο και συγκεκριμένα σε URLs που περιέχουν ελληνικούς όρους. Το κύριο ζήτημα της προτεινόμενης μεθόδου είναι ότι οι ελληνικές λέξεις μπορούν να μεταγλωττίζονται με λατινικούς χαρακτήρες σύμφωνα με πολλούς διαφορετικούς τρόπους, καθώς και το γεγονός ότι τα URLs μπορούν να περιέχουν περισσότερες της μιας λέξεις χωρίς κάποιο διαχωριστικό. Παρόλη την ύπαρξη προηγούμενων προσεγγίσεων για την επεξεργασία ελληνικού διαδικτυακού περιεχομένου, όπως αναζητήσεις στο ελληνικό διαδίκτυο και αναγνώριση οντότητας σε ελληνικές ιστοσελίδες, καμία από τις παραπάνω δεν βασίζεται σε διευθύνσεις URL. Επιπλέον, έχουν αναπτυχθεί πολλές τεχνικές για την κατηγοριοποίηση ιστοσελίδων με βάση κυρίως τις διευθύνσεις URL, αλλά καμία δεν διερευνά την περίπτωση του ελληνικού διαδικτύου. Η προτεινόμενη μέθοδος περιέχει δύο βασικά στοιχεία: το μεταγλωττιστή και τον κατακερματιστή. Ο μεταγλωττιστής, βασισμένος σε ένα ελληνικό λεξικό και ένα σύνολο κανόνων, μετατρέπει τις λέξεις που είναι γραμμένες με λατινικούς χαρακτήρες σε ελληνικούς όρους ενώ παράλληλα ο κατακερματιστής τμηματοποιεί τη διεύθυνση URL σε λέξεις με νόημα, εξάγοντας, έτσι τελικά ελληνικούς όρους που αποτελούν λέξεις κλειδιά. Η πειραματική αξιολόγηση της προτεινόμενης μεθόδου σε δείγμα ελληνικών URLs αποδεικνύει ότι μπορεί να αξιοποιηθεί εποικοδομητικά στην αυτόματη αναγνώριση λέξεων-κλειδιών σε ελληνικά URLs. / The available information on the WWW is increasing rapidly. This observation has triggered many researchers to focus their work on extracting useful features from web documents that would enhance the task of web classification. A quite informative resource that has not been thoroughly explored for languages other than English, is the uniform recourse locator (URL). Motivated by the fact that a significant part of the Web users is interested in web resources, whose URLs contain terms from their non English native languages,written using Latin characters, we propose a method that identifies and extracts successfully keywords within URLs focusing on the Greek Web and especially ons URLs, containing Greek terms. The main issue of this approach is that Greek words can be transliterated to Latin characters in many different ways based on how the words are pronounced rather than on how they are written. Although there are previous attempts on similar issues, like Greek web searches and entity recognition in Greek Web Pages, none of them is based on URLs. In addition, there are many techniques on web page categorization based mainly on URLs but noone explores the case of Greek terms. The proposed method uses a three-step approach; firstly, a normalized URL is divided into its basic components, according to URI protocol (scheme :// host / path-elements / document . extension). The domain part is splitted on the apperance of punctuation marks or numbers. Secondly, domain-tokens are segmented into meaningful tokens using a set of transliteration rules and a Greek dictionary. Finally, in order to identify useful keywords, a score is assigned to each extracted keyword based on its length and whether the word is nested in another word. The algorithm is evaluated on a random sample of 1,000 URLs collected manually. We perform a human-based evaluation comparing the keywords extracted automatically with the keywords extracted manually when no other additional information than the URL is available. The results look promising. Τμηματοποίηση λέξεων 025.042 2 Greeklish to Greek transliteration Keyword extraction Uniform resource locator Word segmentation
19	Automatisk extraktion av nyckelord ur ett kundforum / Automatic keyword extraction from a customer forum Ekman, Sara January 2018 (has links) Konversationerna i ett kundforum rör sig över olika ämnen och språket är inkonsekvent. Texterna uppfyller inte de krav som brukar ställas på material inför automatisk nyckelordsextraktion. Uppsatsens undersöker hur nyckelord automatiskt kan extraheras ur ett kundforum trots dessa svårigheter. Fokus i undersökningen ligger på tre aspekter av nyckelordsextraktion. Den första faktorn rör hur den etablerade nyckelordsextraktionsmetoden TFIDF presterar jämfört med fyra metoder som skapas med hänsyn till materialets ovanliga struktur. Nästa faktor som testas är om olika sätt att räkna ordfrekvens påverkar resultatet. Den tredje faktorn är hur metoderna presterar om de endast använder inläggen, rubrikerna eller båda texttyperna i sina extraktioner. Icke-parametriska test användes för utvärdering av extraktionerna. Ett antal Friedmans test visar att metoderna i några fall skiljer sig åt gällande förmåga att identifiera relevanta nyckelord. I post-hoc-test mellan de högst presterande metoderna ses en av de nya metoderna i ett fall prestera signifikant bättre än de andra nya metoderna men inte bättre än TFIDF. Ingen skillnad hittades mellan användning av olika texttyper eller sätt att räkna ordfrekvens. För framtida forskning rekommenderas reliabilitetstest av manuellt annoterade nyckelord. Ett större stickprov bör användas än det i aktuell studie och olika förslag ges för att förbättra rättning av extraherade nyckelord. / Conversations in a customer forum span across different topics and the language is inconsistent. The text type do not meet the demands for automatic keyword extraction. This essay examines how keywords can be automatically extracted despite these difficulties. Focus in the study are three areas of keyword extraction. The first factor regards how the established keyword extraction method TFIDF performs compared to four methods created with the unusual material in mind. The next factor deals with different ways to calculate word frequency. The third factor regards if the methods use only posts, only titles, or both in their extractions. Non-parametric tests were conducted to evaluate the extractions. A number of Friedman's tests shows the methods in some cases differ in their ability to identify relevant keywords. In post-hoc tests performed between the highest performing methods, one of the new methods perform significantly better than the other new methods but not better than TFIDF. No difference was found between the use of different text types or ways to calculate word frequency. For future research reliability test of manually annotated keywords is recommended. A larger sample size should be used than in the current study and further suggestions are given to improve the results of keyword extractions. Automatic keyword extraction Information extraction Noisy text TFIDF User generated text Användargenererad text Automatisk nyckelordsextraktion Brusig text Informationsextraktion TFIDF General Language Studies and Linguistics
20	Průzkum vlastnictví kódu ve velké organizaci / Code Ownership Research in Large Organization Šimonek, Jan January 2015 (has links) This master's thesis is about code ownership in software projects and creating a tool that would improve cooperation by identifying code owners. The theoretical foundation for code ownership starts from Extreme Programming, explaining models of code ownership and the impact of code ownership on cooperation among teams and team members. The concept is demonstrated on a concrete software firm, where a potential for improvement is identified. The potential is exploited by a software tool, which is designed in the this thesis. The tool is capable of identifying code owners and experts for specific areas in the code based on data gathered from a version control system. The resulting information is made easily accessible. The tool is used to conduct a code ownership research in several projects, which allows me to confirm accuracy of the results. Usability and benefits of the tool is discussed in the final chapter.

Search results