21 |
Data mining časových řad / Time series data miningNovák, Petr January 2009 (has links)
This work deals with modern trends in time series data mining
|
22 |
Mining Social Tags to Predict Mashup PatternsEl-Goarany, Khaled 11 November 2010 (has links)
In this thesis, a tag-based approach is proposed for predicting mashup patterns, thus deriving inspiration for potential new mashups from the community's consensus. The proposed approach applies association rule mining techniques to discover relationships between APIs and mashups based on their annotated tags. The importance of the mined relationships is advocated as a valuable source for recommending mashup candidates while mitigating common problems in recommender systems. The proposed methodology is evaluated through experimentation using a real-life dataset. Results show that the proposed mining approach achieves prediction accuracy with 60% precision and 79% recall improvement over a direct string matching approach that lacks the mining information. / Master of Science
|
23 |
Securing Cyberspace: Analyzing Cybercriminal Communities through Web and Text Mining PerspectivesBenjamin, Victor January 2016 (has links)
Cybersecurity has become one of the most pressing issues facing society today. In particular, cybercriminals often congregate within online communities to exchange knowledge and assets. As a result, there has been a strong interest in recent years in developing a deeper understanding on cybercriminal behaviors, the global cybercriminal supply chain, emerging threats, and various other cybersecurity-related activities. However, few works in recent years have focused on identifying, collecting, and analyzing cybercriminal contents. Despite the high societal impact of cybercriminal community research, only a few studies have leveraged these rich data sources in their totality, and those that do often resort to manual data collection and analysis techniques. In this dissertation, I address two broad research questions: 1) In what ways can I advance cybersecurity as a science by scrutinizing the contents of online cybercriminal communities? and 2) How can I make use of computational methodologies to identify, collect, and analyze cybercriminal communities in an automated and scalable manner? To these ends, the dissertation comprises four essays. The first essay introduces a set of computational methodologies and research guidelines for conducting cybercriminal community research. To this point, there has been no literature establishing a clear route for non-technical and non-security researchers to begin studying such communities. The second essay examines possible motives for prolonged participation by individuals within cybercriminal communities. The third essay develops new neural network language model (NNLM) capabilities and applies them to cybercriminal community data in order to understand hacker-specific language evolution and to identify emerging threats. The last essay focuses on developing a NNLM-based framework for identifying information dissemination among varying international cybercriminal populations by examining multilingual cybercriminal forums. These essays help further establish cybersecurity as a science.
|
24 |
Query log mining in search enginesMendoza Rocha, Marcelo Gabriel January 2007 (has links)
Doctor en Ciencias, Mención Computación / La Web es un gran espacio de información donde muchos recursos como documentos, imágenes u otros contenidos multimediales pueden ser accesados. En este contexto, varias tecnologías de la información han sido desarrolladas para ayudar a los usuarios a satisfacer sus necesidades de búsqueda en la Web, y las más usadas de estas son los motores de búsqueda. Los motores de búsqueda permiten a los usuarios encontrar recursos formulando consultas y revisando una lista de respuestas.
Uno de los principales desafíos para la comunidad de la Web es diseñar motores de búsqueda que permitan a los usuarios encontrar recursos semánticamente conectados con sus consultas. El gran tamaño de la Web y la vaguedad de los términos más comúnmente usados en la formulación de consultas es un gran obstáculo para lograr este objetivo.
En esta tesis proponemos explorar las selecciones de los usuarios registradas en los logs de los motores de búsqueda para aprender cómo los usuarios buscan y también para diseñar algoritmos que permitan mejorar la precisión de las respuestas recomendadas a los usuarios. Comenzaremos explorando las propiedades de estos datos. Esta exploración nos permitirá determinar la naturaleza dispersa de estos datos. Además presentaremos modelos que nos ayudarán a entender cómo los usuarios buscan en los motores de búsqueda.
Luego, exploraremos las selecciones de los usuarios para encontrar asociaciones útiles entre consultas registradas en los logs. Concentraremos los esfuerzos en el diseño de técnicas que permitirán a los usuarios encontrar mejores consultas que la consulta original. Como una aplicación, diseñaremos métodos de reformulación de consultas que ayudarán a los usuarios a encontrar términos más útiles mejorando la representación de sus necesidades.
Usando términos de documentos construiremos representaciones vectoriales para consultas. Aplicando técnicas de clustering podremos determinar grupos de consultas similares. Usando estos grupos de consultas, introduciremos métodos para recomendación de consultas y documentos que nos permitirán mejorar la precisión de las recomendaciones.
Finalmente, diseñaremos técnicas de clasificación de consultas que nos permitirán encontrar conceptos semánticamente relacionados con la consulta original. Para lograr esto, clasificaremos las consultas de los usuarios en directorios Web. Como una aplicación, introduciremos métodos para la manutención automática de los directorios.
|
25 |
Diseño y desarrollo de un módulo de clasificación de páginas Web en base a las características de su contenido utilizando técnicas de minería de datosFalloux Costa, Gonzalo Alejandro January 2016 (has links)
Ingeniero Civil Industrial / Este trabajo de título tiene por objetivo principal diseñar y desarrollar un módulo de clasificación de páginas web en base a las características de su contenido utilizando técnicas de minería de datos, lo que se traduce en la utilización de contenido HTML, análisis de texto visible de la página web y la incorporación de una variable que refleja la seguridad web según SSL como variables predictivas para la clasificación de páginas web.
El trabajo se realiza enmarcado en el proyecto AKORI del Web Intelligence Centre de la Facultad de Ciencias Matemáticas de la Universidad de Chile, el cual pretende desarrollar una plataforma computacional para mejorar el diseño y contenido de sitios web mediante el estudio de variables fisiológicas y la aplicación de minería de datos. La plataforma consiste en la implementación de un modelo que sea capaz de predecir mapas tanto de fijación ocular como de dilatación pupilar de manera rápida y precisa.
En esta etapa del proyecto AKORI es necesario mejorar el desempeño de las predicciones descritas, las cuales son realizadas en sitios web reales, de diseño y contenido muy variado. Además el comportamiento que se desea predecir es sobre usuarios de los que se desconoce su motivación para la navegación, lo cual a su vez altera tanto el comportamiento ocular como sus patrones de navegación.
Dado lo anterior se propone como hipótesis de investigación: Es posible clasificar páginas web en base a las características de su contenido para solucionar dos problemas fundamentales, por un lado la clasificación agrupa páginas web maximizando la varianza de páginas web entre clases y minimizando la varianza intra clase, lo cual debiese mejorar considerablemente el desempeño del modelo, puesto que predecir dentro de una clase en la cual los ejemplos tienen mayor similitud disminuye el rango de error, disminuyendo, a su vez el error estándar en la predicción. Por otro lado entrega información sobre la motivación del usuario en la web si se conoce el servicio que ofrece la página web, lo que si bien no es información completa para describir el comportamiento del usuario, puede ser una importante variable de apoyo.
Para el desarrollo del modelo se utiliza un juego de datos de 138 páginas web, escogidas según tráfico de usuarios Chilenos y luego se implementan cinco algoritmos de minería de datos para clasificar entre siete clases de páginas web. El algoritmo Naive Bayes obtiene el mejor desempeño, logrando un accuracy de 78.67%, lo que permite validar la hipótesis de investigación.
Finalmente se concluye que se cumplen todos los resultados esperados y la hipótesis de investigación con resultados satisfactorios considerando la investigación actual.
|
26 |
Profiling topics on the Web for knowledge discoverySehgal, Aditya Kumar 01 January 2007 (has links)
The availability of large-scale data on the Web motivates the development of automatic algorithms to analyze topics and to identify relationships between topics. Various approaches have been proposed in the literature. Most focus on specific topics, mainly those representing people, with little attention to topics of other kinds. They are also less flexible in how they represent topics.
In this thesis we study existing methods as well as describe a different approach, based on profiles, for representing topics. A Topic Profile is analogous to a synopsis of a topic and consists of different types of features. Profiles are flexible to allow different combinations of features to be emphasized and are extensible to support new features to be incorporated without having to change the underlying logic.
More generally, topic profiles provide an abstract framework that can be used to create different types of concrete representations for topics. Different options regarding the number of documents considered for a topic or types of features extracted can be decided based on requirements of the problem as well as the characteristics of the data. Topic profiles also provide a framework to explore relationships between topics.
We compare different methods for building profiles and evaluate them in terms of their information content and their ability to predict relationships between topics. We contribute new methods in term weighting and for identifying relevant text segments in web documents.
In this thesis, we present an application of our profile-based approach to explore social networks of US senators generated from web data and compare with networks generated from voting data. We consider both general networks as well as issue-specific networks. We also apply topic profiles for identifying and ranking experts given topics of interest, as part of the 2007 TREC Expert Search task.
Overall, our results show that topic profiles provide a strong foundation for exploring different topics and for mining relationships between topics using web data. Our approach can be applied to a wide range of web knowledge discovery problems, in contrast to existing approaches that are mostly designed for specific problems.
|
27 |
Improved Cross-language Information Retrieval via Disambiguation and Vocabulary DiscoveryZhang, Ying, ying.yzhang@gmail.com January 2007 (has links)
Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR.
|
28 |
A Web-based Question Answering SystemZhang, Dell, Lee, Wee Sun 01 1900 (has links)
The Web is apparently an ideal source of answers to a large variety of questions, due to the tremendous amount of information available online. This paper describes a Web-based question answering system LAMP, which is publicly accessible. A particular characteristic of this system is that it only takes advantage of the snippets in the search results returned by a search engine like Google. We think such “snippet-tolerant” property is important for an online question answering system to be practical, because it is time-consuming to download and analyze the original web documents. The performance of LAMP is comparable to the best state-of-the-art question answering systems. / Singapore-MIT Alliance (SMA)
|
29 |
Domain ontology learning from the webSánchez Ruenes, David 14 December 2007 (has links)
El Aprendizaje de Ontologías se define como el conjunto de métodos utilizados para construir, enriquecer o adaptar una ontología existente de forma semiautomática, utilizando fuentes de información heterogéneas. En este proceso se emplea texto, diccionarios electrónicos, ontologías lingüísticas e información estructurada y semiestructurada para extraer conocimiento. Recientemente, gracias al enorme crecimiento de la Sociedad de la Información, la Web se ha convertido en una valiosa fuente de información para casi cualquier dominio. Esto ha provocado que los investigadores empiecen a considerar a la Web como un repositorio válido para Recuperar Información y Adquirir Conocimiento. No obstante, la Web presenta algunos problemas que no se observan en repositorios de información clásicos: presentación orientada al usuario, ruido, fuentes no confiables, alta dinamicidad y tamaño abrumador. Pese a ello, también presenta algunas características que pueden ser interesantes para la adquisición de conocimiento: debido a su enorme tamaño y heterogeneidad, se asume que la Web aproxima la distribución real de la información a nivel global. Este trabajo describe una aproximación novedosa para el aprendizaje de ontologías, presentando nuevos métodos para adquirir conocimiento de la Web. La propuesta se distingue de otros trabajos previos principalmente en la particular adaptación de algunas técnicas clásicas de aprendizaje al corpus Web y en la explotación de las características interesantes del entorno Web para componer una aproximación automática, no supervisada e independiente del dominio. Con respecto al proceso de construcción de la ontologías, se han desarrollado los siguientes métodos: i) extracción y selección de términos relacionados con el dominio, organizándolos de forma taxonómica; ii) descubrimiento y etiquetado de relaciones no taxonómicas entre los conceptos; iii) métodos adicionales para mejorar la estructura final, incluyendo la detección de entidades con nombre, atributos, herencia múltiple e incluso un cierto grado de desambiguación semántica. La metodología de aprendizaje al completo se ha implementado mediante un sistema distribuido basado en agentes, proporcionando una solución escalable. También se ha evaluado para varios dominios de conocimiento bien diferenciados, obteniendo resultados de buena calidad. Finalmente, se han desarrollado varias aplicaciones referentes a la estructuración automática de librerías digitales y recursos Web, y la recuperación de información basada en ontologías. / Ontology Learning is defined as the set of methods used for building from scratch, enriching or adapting an existing ontology in a semi-automatic fashion using heterogeneous information sources. This data-driven procedure uses text, electronic dictionaries, linguistic ontologies and structured and semi-structured information to acquire knowledge. Recently, with the enormous growth of the Information Society, the Web has become a valuable source of information for almost every possible domain of knowledge. This has motivated researchers to start considering the Web as a valid repository for Information Retrieval and Knowledge Acquisition. However, the Web suffers from problems that are not typically observed in classical information repositories: human oriented presentation, noise, untrusted sources, high dynamicity and overwhelming size. Even though, it also presents characteristics that can be interesting for knowledge acquisition: due to its huge size and heterogeneity it has been assumed that the Web approximates the real distribution of the information in humankind. The present work introduces a novel approach for ontology learning, introducing new methods for knowledge acquisition from the Web. The adaptation of several well known learning techniques to the web corpus and the exploitation of particular characteristics of the Web environment composing an automatic, unsupervised and domain independent approach distinguishes the present proposal from previous works.With respect to the ontology building process, the following methods have been developed: i) extraction and selection of domain related terms, organising them in a taxonomical way; ii) discovery and label of non-taxonomical relationships between concepts; iii) additional methods for improving the final structure, including the detection of named entities, class features, multiple inheritance and also a certain degree of semantic disambiguation. The full learning methodology has been implemented in a distributed agent-based fashion, providing a scalable solution. It has been evaluated for several well distinguished domains of knowledge, obtaining good quality results. Finally, several direct applications have been developed, including automatic structuring of digital libraries and web resources, and ontology-based Web Information Retrieval.
|
30 |
A Meaningful Candidate Approach to Mining Bi-Directional Traversal Patterns on the WWWChen, Jiun-rung 27 July 2004 (has links)
Since the World Wide Web (WWW) appeared, more and more useful information has
been available on the WWW. In order to find the information, one application of data
mining techniques on the WWW, referred to as Web mining, has become a research
area with increasing importance. Mining traversal patterns is one of the important
topics in Web mining. It focuses on how to find the Web page sequences which are
frequently browsed by users. Although the algorithms for mining association rules
(e.g., Apriori and DHP algorithms) could be applied to mine traversal patterns, they
do not utilize the property of Web transactions and generate too many invalid candidate
patterns. Thus, they could not provide good performance. Wu et al. proposed
an algorithm for mining traversal patterns, SpeedTracer, which utilizes the property
of Web transactions, i.e., the continuous property of the traversal patterns in the Web
structure. Although they decrease the number of candidate patterns generated in the
mining process, they do not efficiently utilize the property of Web transactions to
decrease the number of checks while checking the subsets of each candidate pattern.
In this thesis, we design three algorithms, which improve the SpeedTracer algorithm,
for mining traversal patterns. For the first algorithm, SpeedTracer*-I, it utilizes the
property of Web transactions to directly generate and count all candidate patterns
from user sessions. Moreover, it utilizes this property to improve the checking step,
when candidate patterns are generated. Next, according to the SpeedTracer*-I algorithm,
we propose SpeedTracer*-II and SpeedTracer*-III algorithms. In these two
algorithms, we improve the performance of the SpeedTracer*-I algorithm by decreasing
the number of times to scan the database. In the SpeedTracer*-II algorithm,
given a parameter n, we apply the SpeedTracer*-I algorithm to find Ln first, and
use Ln to generate all Ck, where k > n. After generating all candidate patterns, we
scan the database once to count all candidate patterns and then the frequent patterns
could be determined. In the SpeedTracer*-III algorithm, given a parameter n, we also
apply the SpeedTracer*-I algorithm to find Ln first, and directly generate and count
Ck from user sessions based on Ln, where k > n. The simulation results show that
the performance of the SpeedTracer*-I algorithm is better than that of the Speed-
Tracer algorithm in terms of the processing time. The simulation results also show
that SpeedTracer*-II and SpeedTracer*-III algorithms outperform SpeedTracer and
SpeedTracer*-I algorithms, because the former two algorithms need less times to scan
the database than the latter two algorithms. Moreover, from our simulation results,
we show that all of our proposed algorithms could provide better performance than
Apriori-like algorithms (e.g., FS and FDLP algorithms) in terms of the processing
time.
|
Page generated in 0.0709 seconds