Global ETD Search

271	Generating Data-Extraction Ontologies By Example Zhou, Yuanqiu 22 November 2005 (has links) (PDF) Ontology-based data-extraction is a resilient web data-extraction approach. A major limitation of this approach is that ontology experts must manually develop and maintain data-extraction ontologies. The limitation prevents ordinary users who have little knowledge of conceptual models from making use of this resilient approach. In this thesis we have designed and implemented a general framework, OntoByE, to generate data-extraction ontologies semi-automatically through a small set of examples collected by users. With the assistance of a limited amount of prior knowledge, experimental evidence shows that OntoByE is capable of interacting with users to generate data-extraction ontologies for domains of interest to them. Ontology Web data data extraction Computer Sciences
272	Comparative Microarray Data Mining Mao, Shihong 27 December 2007 (has links) No description available. Computer Science data mining microarray data comparative
273	An interprocedural framework for data redistributions in distributed memory machines Krishnamurthy, Sudha January 1996 (has links) No description available. Interprocedural Data Redistributions Data Parallelism Functional Parallelism
274	Data Mining over Hidden Data Sources Liu, Tantan 24 August 2012 (has links) No description available. Computer Science Hidden data sources Data mining
275	Data mining in real-world traditional Chinese medicine clinical data warehouse Zhou, X., Liu, B., Zhang, X., Xie, Q., Zhang, R., Wang, Y., Peng, Yonghong January 2014 (has links) No / Real-world clinical setting is the major arena of traditional Chinese medicine (TCM) as it has experienced long-term practical clinical activities, and developed established theoretical knowledge and clinical solutions suitable for personalized treatment. Clinical phenotypes have been the most important features captured by TCM for diagnoses and treatment, which are diverse and dynamically changeable in real-world clinical settings. Together with clinical prescription with multiple herbal ingredients for treatment, TCM clinical activities embody immense valuable data with high dimensionalities for knowledge distilling and hypothesis generation. In China, with the curation of large-scale real-world clinical data from regular clinical activities, transforming the data to clinical insightful knowledge has increasingly been a hot topic in TCM field. This chapter introduces the application of data warehouse techniques and data mining approaches for utilizing real-world TCM clinical data, which is mainly from electronic medical records. The main framework of clinical data mining applications in TCM field is also introduced with emphasizing on related work in this field. The key points and issues to improve the research quality are discussed and future directions are proposed.
276	Clustering of nonstationary data streams: a survey of fuzzy partitional methods Abdullatif, Amr R.A., Masulli, F., Rovetta, S. 20 January 2020 (has links) Yes / Data streams have arisen as a relevant research topic during the past decade. They are real‐time, incremental in nature, temporally ordered, massive, contain outliers, and the objects in a data stream may evolve over time (concept drift). Clustering is often one of the earliest and most important steps in the streaming data analysis workflow. A comprehensive literature is available about stream data clustering; however, less attention is devoted to the fuzzy clustering approach, even though the nonstationary nature of many data streams makes it especially appealing. This survey discusses relevant data stream clustering algorithms focusing mainly on fuzzy methods, including their treatment of outliers and concept drift and shift. / Ministero dell‘Istruzione, dell‘Universitá e della Ricerca. Data streams Fuzzy clustering Nonstationary data Survey
277	Data Sharing and Retrieval of Manufacturing Processes Seth, Avi 28 March 2023 (has links) With Industrial Internet, businesses can pool their resources to acquire large amounts of data that can then be used in machine learning tasks. Despite the potential to speed up training and deployment and improve decision-making through data-sharing, rising privacy concerns are slowing the spread of such technologies. As businesses are naturally protective of their data, this poses a barrier to interoperability. While previous research has focused on privacy-preserving methods, existing works typically consider data that is averaged or randomly sampled by all contributors rather than selecting data that are best suited for a specific downstream learning task. In response to the dearth of efficient data-sharing methods for diverse machine learning tasks in the Industrial Internet, this work presents an end-to end working demonstration of a search engine prototype built on PriED, a task-driven data-sharing approach that enhances the performance of supervised learning by judiciously fusing shared and local participant data. / Master of Science / My work focuses on PriED - a data sharing framework that enhances machine learning performance while also preserving user data privacy. In particular, I have built a working demonstration of a search engine that leverages the PriED framework and allows users to collaborate with their data without compromising their data privacy. data sharing privacy collaboration attention data distillation
278	A Framework for Hadoop Based Digital Libraries of Tweets Bock, Matthew 17 July 2017 (has links) The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. Researchers across varying disciplines have an interest in leveraging DLRL's collections of tweets for their own analyses. However, due to the steep learning curve involved with the required tools (Spark, Scala, HBase, etc.), simply converting the Twitter data into a workable format can be a cumbersome task in itself. This prompted the effort to build a framework that will help in developing code to analyze the Twitter data, run on arbitrary tweet collections, and enable developers to leverage projects designed with this general use in mind. The intent of this thesis work is to create an extensible framework of tools and data structures to represent Twitter data at a higher level and eliminate the need to work with raw text, so as to make the development of new analytics tools faster, easier, and more efficient. To represent this data, several data structures were designed to operate on top of the Hadoop and Spark libraries of tools. The first set of data structures is an abstract representation of a tweet at a basic level, as well as several concrete implementations which represent varying levels of detail to correspond with common sources of tweet data. The second major data structure is a collection structure designed to represent collections of tweet data structures and provide ways to filter, clean, and process the collections. All of these data structures went through an iterative design process based on the needs of the developers. The effectiveness of this effort was demonstrated in four distinct case studies. In the first case study, the framework was used to build a new tool that selects Twitter data from DLRL's archive of tweets, cleans those tweets, and performs sentiment analysis within the topics of a collection's topic model. The second case study applies the provided tools for the purpose of sociolinguistic studies. The third case study explores large datasets to accumulate all possible analyses on the datasets. The fourth case study builds metadata by expanding the shortened URLs contained in the tweets and storing them as metadata about the collections. The framework proved to be useful and cut development time for all four of the case studies. / Master of Science / The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. Researchers across varying disciplines have an interest in leveraging DLRL’s collections of tweets for their own analyses. However, due to the steep learning curve involved with the required tools, simply converting the Twitter data into a workable format can be a cumbersome task in itself. This prompted the effort to build a programming framework that will help in developing code to analyze the Twitter data, run on arbitrary tweet collections, and enable developers to leverage projects designed with this general use in mind. The intent of this thesis work is to create an extensible framework of tools and data structures to represent Twitter data at a higher level and eliminate the need to work with raw text, so as to make the development of new analytics tools faster, easier, and more efficient. The effectiveness of this effort was demonstrated in four distinct case studies. In the first case study, the framework was used to build a new tool that selects Twitter data from DLRL’s archive of tweets, cleans those tweets, and performs sentiment analysis within the topics of a collection’s topic model. The second case study applies the provided tools for the purpose of sociolinguistic studies. The third case study explores large datasets to accumulate all possible analyses on the datasets. The fourth case study builds metadata by expanding the shortened URLs contained in the tweets and storing them as metadata about the collections. The framework proved to be useful and cut development time for all four of the case studies. big data digital libraries data structures
279	Rise and Pitfalls of Synthetic Data for Abusive Language Detection Casula, Camilla 28 October 2024 (has links) Synthetic data has been proposed as a method to potentially mitigate a number of issues with existing models and datasets for abusive language detection online, such as negative psychological impact on annotators, privacy issues, dataset obsolescence and representation bias. However, previous work on the topic has mostly focused on downstream task performance of models, without paying much attention to the evaluation of other aspects. In this thesis, we carry out a series of experiments and analyses on synthetic data for abusive language detection going beyond performance, with the goal of assessing both the potential and the pitfalls of synthetic data from a qualitative point of view. More specifically, we study synthetic data for abusive language detection in English focusing on four aspects: robustness, examining the ability of models trained on synthetic data to generalize to out-of-distribution scenarios; fairness, with an exploration of the representation of identity groups; privacy, exploring the use of entirely synthetic datasets to avoid sharing user-generated data; and finally we consider the quality of the synthetic data, through a manual annotation and analysis of how realistic and representative of real data synthetic data can be with regards to abusive language.
280	Big data, data mining, and machine learning: value creation for business leaders and practitioners Dean, J. January 2014 (has links) No / Big data is big business. But having the data and the computational power to process it isn't nearly enough to produce meaningful results. Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners is a complete resource for technology and marketing executives looking to cut through the hype and produce real results that hit the bottom line. Providing an engaging, thorough overview of the current state of big data analytics and the growing trend toward high performance computing architectures, the book is a detail-driven look into how big data analytics can be leveraged to foster positive change and drive efficiency. With continued exponential growth in data and ever more competitive markets, businesses must adapt quickly to gain every competitive advantage available. Big data analytics can serve as the linchpin for initiatives that drive business, but only if the underlying technology and analysis is fully understood and appreciated by engaged stakeholders. Big data Analytics Data mining Machine learning

Search results