Global ETD Search

11	Um data warehouse de publicações científicas: indexação automática da dimensão tópicos de pesquisa dos data marts / A Data warehouse for scientific publications: automatic indexing of the research topic dimension for using in data marts Augusto Kanashiro 04 May 2007 (has links) Este trabalho de mestrado insere-se no contexto do projeto de uma Ferramenta Inteligente de Apoio à Pesquisa (FIP), sendo desenvolvida no Laboratório de Inteligência Computacional do ICMC-USP. A ferramenta foi proposta para recuperar, organizar e minerar grandes conjuntos de documentos científicos (na área de computação). Nesse contexto, faz-se necessário um repositório de artigos para a FIP. Ou seja, um Data Warehouse que armazene e integre todas as informações extraídas dos documentos recuperados de diferentes páginas pessoais, institucionais e de repositórios de artigos da Web. Para suportar o processamento analítico on-line (OLAP) das informações e facilitar a ?mineração? desses dados é importante que os dados estejam armazenados apropriadamente. Dessa forma, o trabalho de mestrado teve como objetivo principal projetar um Data Warehouse (DW) para a ferramenta FIP e, adicionalmente, realizar experimentos com técnicas de mineração e Aprendizado de Máquina para automatizar o processo de indexação das informações e documentos armazenados no data warehouse (descoberta de tópicos). Para as consultas multidimensionais foram construídos data marts de forma a permitir aos pesquisadores avaliar tendências e a evolução de tópicos de pesquisa / This dissertation is related to the project of an Intelligent Tool for Research Supporting (FIP), being developed at the Laboratory of Computational Intelligence at ICMC-USP. The tool was proposed to retrieve, organize, and mining large sets of scientific documents in the field of computer science. In this context, a repository of articles becomes necessary, i.e., a Data Warehouse that integrates and stores all extracted information from retrieved documents from different personal and institutional web pages, and from article repositories. Data appropriatelly stored is decisive for supporting online analytical processing (OLAP), and ?data mining? processes. Thus, the main goal of this MSc research was design the FIP Data Warehouse (DW). Additionally, we carried out experiments with Data Mining and Machine Learning techniques in order to automatize the process of indexing of information and documents stored in the data warehouse (Topic Detection). Data marts for multidimensional queries were designed in order to facilitate researchers evaluation of research topics trend and evolution Aprendizado de máquina Data mart Data warehouse Detecção de tópicos de pesquisa Mineração de dados Mineração de textos OLAP Data mart Data mining Data warehouse Machine learning OLAP Research topic detection Text mining
12	Topic-Based Aggregation of Questions in Social Media Muthmann, Klemens January 2013 (has links) Software produced by big companies such as SAP is often feature rich, very expensive and thus only affordable by other big companies. It usually takes months and special trained consultants to install and manage such software. However as vendors move to other market segments, featuring smaller companies, different requirements arise. It is not possible for medium or small sized companies to spend as much money for business software solutions as big companies do. They especially cannot afford to hire expensive consultants. It is on the other hand not economic for the vendor to provide the personnel free of charge. One solution to this dilemma is bundling all customer support cases on special Web platforms, such as customer support forums. SAP for example has the SAP Community Network1. This has the additional benefit that customers may help each other. (...) info:eu-repo/classification/ddc/330 ddc:330
13	巨量資料環境下之新聞主題暨輿情與股價關係之研究 / A Study of the Relevance between News Topics & Public Opinion and Stock Prices in Big Data 張良杰, Chang, Liang Chieh Unknown Date (has links) 近年來科技、網路以及儲存媒介的發達，產生的資料量呈現爆炸性的成長，也宣告了巨量資料時代的來臨。擁有巨量資料代表了不必再依靠傳統抽樣的方式來蒐集資料，分析數據也不再有資料收集不足以致於無法代表母題的限制。突破傳統的限制後，巨量資料的精隨在於如何從中找出有價值的資訊。以擁有大量輿論和人際互動資訊的社群網站為例，就有相關學者研究其情緒與股價具有正相關性，本研究也試著利用同樣具有巨量資料特性的網路新聞，抓取中央新聞社2013年7月至2014年5月之經濟類新聞共計30,879篇，結合新聞主題偵測與追蹤技術及情感分析，利用新聞事件相似的概念，透過連結匯聚成網絡並且分析新聞的情緒和股價指數的關係。研究結果顯示，新聞事件間可以連結成一特定新聞主題，且能在龐大的網絡中找出不同的新聞主題，並透過新聞主題之連結產生新聞主題脈絡。對此提供一種新的方式來迅速了解巨量新聞內容，也能有效的回溯新聞主題及新聞事件。在新聞情緒和股價指數方面，研究發現新聞情緒影響了股價指數之波動，其相關係數達到0.733562；且藉由情緒與心理線及買賣意願指標之比較，顯示新聞的情緒具有一定的程度能夠成為股價判斷之參考依據。 / In recent years, the technology, network, and storage media developed, the amount of generated data with the explosive growth, and also declared the new era of big data. Having big data let us no longer rely on the traditional sample ways to collect data, and no longer have the issue that could not represent the population which caused by the inadequate data collection. Once we break the limitations, the main spirit of big data is how to find out the valuable information in big data. For example, the social network sites (SNS) have a lot of public opinions and interpersonal information, and scholars have founded that the emotions in SNS have a positive correlation with stock prices. Therefore, the thesis tried to focus on the news which have the same characteristic of big data, using the web crawl to catch total of 30,879 economics news articles form the Central News Agency, furthermore, took the “Topic Detection & Tracking” and “Sentiment Analysis” technology on these articles. Finally, based on the concept of the similarity between news articles, through the links converging networks and analyze the relevant between news sentiment and stock prices. The results shows that news events can be linked to specific news topics, identify different news topics in a large network, and form the news topic context by linked news topics together. The thesis provides a new way to quickly understand the huge amount of news, and backtracking news topics and news event with effective. In the aspect of news sentiment and stock prices, the results shows that the news sentiments impact the fluctuations of stock prices, and the correlation coefficient is 0.733562. By comparing the emotion with psychological lines & trading willingness indicators, the emotion is better than the two indicators in the stock prices determination. 巨量資料文字探勘新聞主題偵測與追蹤連結分析情感分析 Big data Text mining News topic detection and tracking Link analysis Sentiment analysis
14	Appariement de contenus textuels dans le domaine de la presse en ligne : développement et adaptation d'un système de recherche d'information / Pairing textual content in the field of on-line news : development and adaptation of an information retrieval system Désoyer, Adèle 27 November 2017 (has links) L'objectif de cette thèse, menée dans un cadre industriel, est d'apparier des contenus textuels médiatiques. Plus précisément, il s'agit d'apparier à des articles de presse en ligne des vidéos pertinentes, pour lesquelles nous disposons d'une description textuelle. Notre problématique relève donc exclusivement de l'analyse de matériaux textuels, et ne fait intervenir aucune analyse d'image ni de langue orale. Surviennent alors des questions relatives à la façon de comparer des objets textuels, ainsi qu'aux critères mobilisés pour estimer leur degré de similarité. L'un de ces éléments est selon nous la similarité thématique de leurs contenus, autrement dit le fait que deux documents doivent relater le même sujet pour former une paire pertinente. Ces problématiques relèvent du domaine de la recherche d'information (ri), dans lequel nous nous ancrons principalement. Par ailleurs, lorsque l'on traite des contenus d'actualité, la dimension temporelle est aussi primordiale et les problématiques qui l'entourent relèvent de travaux ayant trait au domaine du topic detection and tracking (tdt) dans lequel nous nous inscrivons également.Le système d'appariement développé dans cette thèse distingue donc différentes étapes qui se complètent. Dans un premier temps, l'indexation des contenus fait appel à des méthodes de traitement automatique des langues (tal) pour dépasser la représentation classique des textes en sac de mots. Ensuite, deux scores sont calculés pour rendre compte du degré de similarité entre deux contenus : l'un relatif à leur similarité thématique, basé sur un modèle vectoriel de ri; l'autre à leur proximité temporelle, basé sur une fonction empirique. Finalement, un modèle de classification appris à partir de paires de documents, décrites par ces deux scores et annotées manuellement, permet d'ordonnancer les résultats.L'évaluation des performances du système a elle aussi fait l'objet de questionnements dans ces travaux de thèse. Les contraintes imposées par les données traitées et le besoin particulier de l'entreprise partenaire nous ont en effet contraints à adopter une alternative au protocole classique d'évaluation en ri, le paradigme de Cranfield. / The goal of this thesis, conducted within an industrial framework, is to pair textual media content. Specifically, the aim is to pair on-line news articles to relevant videos for which we have a textual description. The main issue is then a matter of textual analysis, no image or spoken language analysis was undertaken in the present study. The question that arises is how to compare these particular objects, the texts, and also what criteria to use in order to estimate their degree of similarity. We consider that one of these criteria is the topic similarity of their content, in other words, the fact that two documents have to deal with the same topic to form a relevant pair. This problem fall within the field of information retrieval (ir) which is the main strategy called upon in this research. Furthermore, when dealing with news content, the time dimension is of prime importance. To address this aspect, the field of topic detection and tracking (tdt) will also be explored.The pairing system developed in this thesis distinguishes different steps which complement one another. In the first step, the system uses natural language processing (nlp) methods to index both articles and videos, in order to overcome the traditionnal bag-of-words representation of texts. In the second step, two scores are calculated for an article-video pair: the first one reflects their topical similarity and is based on a vector space model; the second one expresses their proximity in time, based on an empirical function. At the end of the algorithm, a classification model learned from manually annotated document pairs is used to rank the results.Evaluation of the system's performances raised some further questions in this doctoral research. The constraints imposed both by the data and the specific need of the partner company led us to adapt the evaluation protocol traditionnal used in ir, namely the cranfield paradigm. We therefore propose an alternative solution for evaluating the system that takes all our constraints into account. Système de recherche d'information Recommandation basée sur le contenu Apprentissage supervisé Cadre d'évaluation Contexte industriel Information retrieval system Topic detection and tracking Content-Based recommendation Supervised learning Evaluation framework Industrial context
15	Vers des moteurs de recherche "intelligents" : un outil de détection automatique de thèmes : méthode basée sur l'identification automatique des chaînes de référence / Toward "intelligent" search engines : an automatic topic detection tool : method based on automatic reference chains identification Longo, Laurence 12 December 2013 (has links) Cette thèse se situe dans le domaine du Traitement Automatique des Langues et vise à optimiser la classification des documents dans les moteurs de recherche. Les travaux se concentrent sur le développement d’un outil de détection automatique des thèmes des documents (ATDS-fr). Utilisant peu de connaissances, la méthode hybride adoptée allie des techniques statistiques de segmentation thématique à des méthodes linguistiques identifiant des marqueurs de cohésion. Parmi eux, les chaînes de référence – séquence d’expressions référentielles se rapportant à la même entité du discours (e.g. Paul…il…cet homme) – ont fait l’objet d’une attention particulière, car elles constituent un indice textuel important dans la détection des thèmes (i.e. ce sont des marqueurs d’introduction, de maintien et de changement thématique). Ainsi, à partir d’une étude des chaînes de référence menée dans un corpus issu de genres textuels variés (analyses politiques, rapports publics, lois européennes,éditoriaux, roman), nous avons développé un module d’identification automatique des chaînes de référence RefGen qui a été évalué suivant les métriques actuelles de la coréférence. / This thesis in the field of Natural Language Processing aims at optimizing documents classification in search engines. This work focuses on the development of a tool that automatically detects documents topics (ATDS-fr). Using poor knowledge, the hybrid method combines statistical techniques for topic segmentation and linguistic methods that identify cohesive markers. Among them, reference chains - sequences of referential expressions referring to the same entity (e.g. Paul ... he ... this man) - have been given special attention as they are important topic markers (i.e. they are markers of topic introduction, maintenance and change). Thus, from a study of reference chains extracted from a corpus composed of various textual genres (newspapers, public reports, European laws, editorials and novel) we developed RefGen, an automatic reference chains identification module, which was evaluated according to current coreference metrics. Détection automatique de thèmes Chaînes de référence Traitement automatique des langues Sémantique lexicale Coréférence Genres textuels Segmentation thématique Marqueurs linguistiques Cohésion Linguistique de corpus Topic detection Reference chains Natural language processing Lexical semantics Coreference Textual genre Topic segmentation Linguistic markers Cohesion Corpus linguistics 401.4 004.678

Page generated in 0.1119 seconds