Global ETD Search

621	Interactive pattern mining of neuroscience data Waranashiwar, Shruti Dilip 29 January 2014 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Text mining is a process of extraction of knowledge from unstructured text documents. We have huge volumes of text documents in digital form. It is impossible to manually extract knowledge from these vast texts. Hence, text mining is used to find useful information from text through the identification and exploration of interesting patterns. The objective of this thesis in text mining area is to find compact but high quality frequent patterns from text documents related to neuroscience field. We try to prove that interactive sampling algorithm is efficient in terms of time when compared with exhaustive methods like FP Growth using RapidMiner tool. Instead of mining all frequent patterns, all of which may not be interesting to user, interactive method to mine only desired and interesting patterns is far better approach in terms of utilization of resources. This is especially observed with large number of keywords. In interactive patterns mining, a user gives feedback on whether a pattern is interesting or not. Using Markov Chain Monte Carlo (MCMC) sampling method, frequent patterns are generated in an interactive way. Thesis discusses extraction of patterns between the keywords related to some of the common disorders in neuroscience in an interactive way. PubMed database and keywords related to schizophrenia and alcoholism are used as inputs. This thesis reveals many associations between the different terms, which are otherwise difficult to understand by reading articles or journals manually. Graphviz tool is used to visualize associations. Data Mining Text Mining PubMed Graphic methods -- Data processing Software visualization User interfaces (Computer systems) Neuroinformatics -- Data processing Markov processes Monte Carlo method Statistics -- Data processing Life sciences literature -- Research Schizophrenia -- Data processing Alcoholism -- Data processing
622	Stylometry: Quantifying Classic Literature For Authorship Attribution : - A Machine Learning Approach Yousif, Jacob, Scarano, Donato January 2024 (has links) Classic literature is rich, be it linguistically, historically, or culturally, making it valuable for future studies. Consequently, this project chose a set of 48 classic books to conduct a stylometric analysis on the defined set of books, adopting an approach used by a related work to divide the books into text segments, quantify the resulting text segments, and analyze the books using the quantified values to understand the linguistic attributes of the books. Apart from the latter, this project conducted different classification tasks for other objectives. In one respect, the study used the quantified values of the text segments of the books for classification tasks using advanced models like LightGBM and TabNet to assess the application of this approach in authorship attribution. From another perspective, the study utilized a State-Of-The-Art model, namely, RoBERTa for classification tasks using the segmented texts of the books instead to evaluate the performance of the model in authorship attribution. The results uncovered the characteristics of the books to a reasonable degree. Regarding the authorship attribution tasks, the results suggest that segmenting and quantifying text using stylometric analysis and supervised machine learning algorithms is practical in such tasks. This approach, while showing promise, may still require further improvements to achieve optimal performance. Lastly, RoBERTa demonstrated high performance in authorship attribution tasks. Authorship Attribution Classic Literature Analysis Clustering Data Science Deep Learning Feature Engineering Feature Extraction Gradient Descent K-Means LightGBM Machine Learning Multiclass Classification NLP Neural Network RoBERTa Stylometric Analysis Stylometry TabNet t-SNE Text Mining Transformer Models Computer Sciences Datavetenskap (datalogi) Computer and Information Sciences Data- och informationsvetenskap
623	Zoetrope – Interactive Feature Exploration in News Videos Liebl, Bernhard, Burghardt, Manuel 11 July 2024 (has links) No description available. info:eu-repo/classification/ddc/006 ddc:006 info:eu-repo/classification/ddc/770 ddc:770
624	文字背後的意含-資訊的量化測量公司基本面與股價（以中鋼為例） / Behind the words - quantifying information to measure firms' fundamentals and stock return (taking the China steel corporation as example) 傅奇珅, Fu, Chi Shen Unknown Date (has links) 本研究蒐集經濟日報、聯合報、與聯合晚報的新聞文章，以中研院的中文斷詞性統進行結構性的處理，參考並延伸Tetlock、Saar-Tsechansky和Macskassy(2008)的研究方法，檢驗使用一個簡單的語言量化方式是否能夠用來解釋與預測個別公司的會計營收與股票報酬。有以下發現： 1. 正面詞彙(褒義詞)在新聞報導中的比例能夠預測高的公司營收。 2. 公司的股價對負面詞彙(貶義詞)有過度反應的現象，對正面詞彙(褒義詞)則有效率地充分反應。綜合以上發現，本論文得到，新聞媒體的文字內容能夠捕捉到一些關於公司基本面難以量化的部份，而投資者迅速地將這些資訊併入股價。 / This research collects all of the news stories about China Steel Corporation from Economic Daily News, United Daily News, and United Evening News. These articles I collect are segmented by a Chinese Word Segmentation System of Academia Sinica and used by the methodology of Tetlock, Saar-Tsechansky, and Macskassy(2008). I examine whether a simple quantitative measure fo language can be used to predict individual firms’ accounting sales and stock returns. My two main findings are: 1. the fraction of positive words (commendatory term) in firm-specific news stories forecasts high firm sales; 2. firm’s stock prices briefly overreaction to the information embedded in negative words (Derogatory term); on the other hand, firm’s stock prices efficiently incorporate the information embedded in positive words (commendatory term). All of the above, we conclude this linguistic media content captures otherwise hard-toquantify aspects of firms’ fundamentals, which investors quickly incorporate into stock prices. 內容分析法文字資訊資訊內涵文件資料探勘關鍵資訊擷取資訊效果褒義詞貶義詞正面詞彙負面詞彙基本面分析股票報酬分析 Content Analysis Textual Information Informative Content Text Mining Information Effect Critical Information Extraction Commendatory Term Derogatory Term Positive words Negative words Fundamental Analysis Stock Return Analysis
625	Semi-automated Ontology Generation for Biocuration and Semantic Search Wächter, Thomas 01 February 2011 (has links) (PDF) Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org. Alternativen zu Tierversuchen Tierschutz Internet Software Ontologielernen Automatische Termgenerierung Definitionsextraction Suchmaschine 3R Prinzip Literatursuche Recherche REACH Animal Testing Alternatives Animal Welfare Biomedical Research Internet Software Ontology Learning Automatic Term Recognition Definition Extraction Search Engine 3Rs Principle Literature Search REACH ddc:006 ddc:004 ddc:576 rvk:WC 7700 Ontologie Ontologie <Wissensverarbeitung> Tierversuch Suchmaschine Europäische Union / REACH-Verordnung Indexierung <Inhaltserschließung> Schlagwortkatalogisierung Information Retrieval Information-Retrieval-System Biomedizin Text Mining
626	Tuning of machine learning algorithms for automatic bug assignment Artchounin, Daniel January 2017 (has links) In software development projects, bug triage consists mainly of assigning bug reports to software developers or teams (depending on the project). The partial or total automation of this task would have a positive economic impact on many software projects. This thesis introduces a systematic four-step method to find some of the best configurations of several machine learning algorithms intending to solve the automatic bug assignment problem. These four steps are respectively used to select a combination of pre-processing techniques, a bug report representation, a potential feature selection technique and to tune several classifiers. The aforementioned method has been applied on three software projects: 66 066 bug reports of a proprietary project, 24 450 bug reports of Eclipse JDT and 30 358 bug reports of Mozilla Firefox. 619 configurations have been applied and compared on each of these three projects. In production, using the approach introduced in this work on the bug reports of the proprietary project would have increased the accuracy by up to 16.64 percentage points. bug triage bug assignment bug mining bug report activity-based approach issue tracking bug repository bug tracker pre-processing feature extraction feature selection tuning model selection hyper-parameter optimization text mining text classification classifier supervised learning machine learning information retrieval bugzilla eclipse jdt mozilla firefox open source software proprietary project accuracy mean reciprocal rank software development software maintenance software engineering Computer and Information Sciences Data- och informationsvetenskap
627	Aural Mapping of STEM Concepts Using Literature Mining Bharadwaj, Venkatesh 06 March 2013 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Recent technological applications have made the life of people too much dependent on Science, Technology, Engineering, and Mathematics (STEM) and its applications. Understanding basic level science is a must in order to use and contribute to this technological revolution. Science education in middle and high school levels however depends heavily on visual representations such as models, diagrams, figures, animations and presentations etc. This leaves visually impaired students with very few options to learn science and secure a career in STEM related areas. Recent experiments have shown that small aural clues called Audemes are helpful in understanding and memorization of science concepts among visually impaired students. Audemes are non-verbal sound translations of a science concept. In order to facilitate science concepts as Audemes, for visually impaired students, this thesis presents an automatic system for audeme generation from STEM textbooks. This thesis describes the systematic application of multiple Natural Language Processing tools and techniques, such as dependency parser, POS tagger, Information Retrieval algorithm, Semantic mapping of aural words, machine learning etc., to transform the science concept into a combination of atomic-sounds, thus forming an audeme. We present a rule based classification method for all STEM related concepts. This work also presents a novel way of mapping and extracting most related sounds for the words being used in textbook. Additionally, machine learning methods are used in the system to guarantee the customization of output according to a user's perception. The system being presented is robust, scalable, fully automatic and dynamically adaptable for audeme generation. Science -- Study and teaching Technology -- Study and teaching Engineering -- Study and teaching Mathematics -- Study and teaching Nonverbal communication in education Computers -- Valuation Assistive computer technology Educational technology -- Evaluation Intelligent agents (Computer software) Semantics -- Data processing Machine learning User-centered system design Parsing (Computer grammar)
628	Mining of Textual Data from the Web for Speech Recognition / Mining of Textual Data from the Web for Speech Recognition Kubalík, Jakub January 2010 (has links) Prvotním cílem tohoto projektu bylo prostudovat problematiku jazykového modelování pro rozpoznávání řeči a techniky pro získávání textových dat z Webu. Text představuje základní techniky rozpoznávání řeči a detailněji popisuje jazykové modely založené na statistických metodách. Zvláště se práce zabývá kriterii pro vyhodnocení kvality jazykových modelů a systémů pro rozpoznávání řeči. Text dále popisuje modely a techniky dolování dat, zvláště vyhledávání informací. Dále jsou představeny problémy spojené se získávání dat z webu, a v kontrastu s tím je představen vyhledávač Google. Součástí projektu byl návrh a implementace systému pro získávání textu z webu, jehož detailnímu popisu je věnována náležitá pozornost. Nicméně, hlavním cílem práce bylo ověřit, zda data získaná z Webu mohou mít nějaký přínos pro rozpoznávání řeči. Popsané techniky se tak snaží najít optimální způsob, jak data získaná z Webu použít pro zlepšení ukázkových jazykových modelů, ale i modelů nasazených v reálných rozpoznávacích systémech.
629	Jak vytvořit samostatně motivované vzdělávání: Případová studie Coursera & Khan Academy 2014 / How to Create Self-Driven Education: The Social Web & Social Sciences, Coursera & Khan Academy 2014 Case Study Růžička, Jakub January 2015 (has links) This diploma thesis is concerned with the possibilities of the social web data employment in social sciences. Its theoretical part describes the changes in education in the context of the dynamics of contemporary society within three fundamental (interrelated) dimensions of technology (the cause and/or the tool for the change), work (new models of collaboration), and economics (sustainability of free & open-source business models). The main methodological part of the thesis is focused on the issues of sampling, sample representativeness, validity & reliability assessment, ethics, and data collection of the emerging social web research in social sciences. The research part includes illustrative social web analyses and conclusions of the author's 2014 Coursera & Khan Academy on the Social Web research and provides the full research report in its attachement to compare its results to the theoretical part in order to provide a "naive" (as derived from the social web mentions and networks) answer to the fundamental question: "How to Create Self-Driven Education?" Powered by TCPDF (www.tcpdf.org)
630	Semi-automated Ontology Generation for Biocuration and Semantic Search Wächter, Thomas 27 October 2010 (has links) Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org. info:eu-repo/classification/ddc/006 ddc:006 info:eu-repo/classification/ddc/004 ddc:004 info:eu-repo/classification/ddc/576 ddc:576

Search results