81 |
Získávání znalostí z webových logů / Knowledge Discovery from Web LogsVlk, Vladimír January 2013 (has links)
This master's thesis deals with creating of an application, goal of which is to perform data preprocessing of web logs and finding association rules in them. The first part deals with the concept of Web mining. The second part is devoted to Web usage mining and notions related to it. The third part deals with design of the application. The forth section is devoted to describing the implementation of the application. The last section deals with experimentation with the application and results interpretation.
|
82 |
Získávání znalostí na webu - shlukování / Web Mining - ClusteringRychnovský, Martin January 2008 (has links)
This work presents the topic of data mining on the web. It is focused on clustering. The aim of this project was to study the field of clustering and to implement clustering through the k-means algorithm. Then, the algorithm was tested on a dataset of text documents and on data extracted from web. This clustering method was implemented by means of Java technologies.
|
83 |
EXPLORING HEALTH WEBSITE USERS BY WEB MININGKong, Wei 07 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / With the continuous growth of health information on the Internet, providing user-orientated health service online has become a great challenge to health providers. Understanding the information needs of the users is the first step to providing tailored health service. The purpose of this study is to examine the navigation behavior of different user groups by extracting their search terms and to make some suggestions to reconstruct a website for more customized Web service. This study analyzed five months’ of daily access weblog files from one local health provider’s website, discovered the most popular general topics and health related topics, and compared the information search strategies for both patient/consumer and doctor groups. Our findings show that users are not searching health information as much as was thought. The top two health topics which patients are concerned about are children’s health and occupational health. Another topic that both user groups are interested in is medical records. Also, patients and doctors have different search strategies when looking for information on this website. Patients get back to the previous page more often, while doctors usually go to the final page directly and then leave the page without coming back. As a result, some suggestions to redesign and improve the website are discussed; a more intuitive portal and more customized links for both user groups are suggested.
|
84 |
Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns. The Development and Evaluation of New Web Mining Methods that enhance Information Retrieval and improve the Understanding of User¿s Web Behavior in Websites and Social Blogs.Ammari, Ahmad N. January 2010 (has links)
The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer.
With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs.
|
85 |
Digital Intelligence – Möglichkeiten und Umsetzung einer informatikgestützten Frühaufklärung / Digital Intelligence – opportunities and implementation of a data-driven foresightWalde, Peter 18 January 2011 (has links) (PDF)
Das Ziel der Digital Intelligence bzw. datengetriebenen Strategischen Frühaufklärung ist, die Zukunftsgestaltung auf Basis valider und fundierter digitaler Information mit vergleichsweise geringem Aufwand und enormer Zeit- und Kostenersparnis zu unterstützen. Hilfe bieten innovative Technologien der (halb)automatischen Sprach- und Datenverarbeitung wie z. B. das Information Retrieval, das (Temporal) Data, Text und Web Mining, die Informationsvisualisierung, konzeptuelle Strukturen sowie die Informetrie. Sie ermöglichen, Schlüsselthemen und latente Zusammenhänge aus einer nicht überschaubaren, verteilten und inhomogenen Datenmenge wie z. B. Patenten, wissenschaftlichen Publikationen, Pressedokumenten oder Webinhalten rechzeitig zu erkennen und schnell und zielgerichtet bereitzustellen. Die Digital Intelligence macht somit intuitiv erahnte Muster und Entwicklungen explizit und messbar.
Die vorliegende Forschungsarbeit soll zum einen die Möglichkeiten der Informatik zur datengetriebenen Frühaufklärung aufzeigen und zum zweiten diese im pragmatischen Kontext umsetzen.
Ihren Ausgangspunkt findet sie in der Einführung in die Disziplin der Strategischen Frühaufklärung und ihren datengetriebenen Zweig – die Digital Intelligence.
Diskutiert und klassifiziert werden die theoretischen und insbesondere informatikbezogenen Grundlagen der Frühaufklärung – vor allem die Möglichkeiten der zeitorientierten Datenexploration.
Konzipiert und entwickelt werden verschiedene Methoden und Software-Werkzeuge, die die zeitorientierte Exploration insbesondere unstrukturierter Textdaten (Temporal Text Mining) unterstützen. Dabei werden nur Verfahren in Betracht gezogen, die sich im Kontext einer großen Institution und den spezifischen Anforderungen der Strategischen Frühaufklärung pragmatisch nutzen lassen. Hervorzuheben sind eine Plattform zur kollektiven Suche sowie ein innovatives Verfahren zur Identifikation schwacher Signale.
Vorgestellt und diskutiert wird eine Dienstleistung der Digital Intelligence, die auf dieser Basis in einem globalen technologieorientierten Konzern erfolgreich umgesetzt wurde und eine systematische Wettbewerbs-, Markt- und Technologie-Analyse auf Basis digitaler Spuren des Menschen ermöglicht.
|
86 |
Algoritmo rastreador web especialista nuclear / Nuclear expert web crawler algorithmReis, Thiago 12 November 2013 (has links)
Nos últimos anos a Web obteve um crescimento exponencial, se tornando o maior repositório de informações já criado pelo homem e representando uma fonte nova e relevante de informações potencialmente úteis para diversas áreas, inclusive a área nuclear. Entretanto, devido as suas características e, principalmente, devido ao seu grande volume de dados, emerge um problema desafiador relacionado à utilização das suas informações: a busca e recuperação informações relevantes e úteis. Este problema é tratado por algoritmos de busca e recuperação de informação que trabalham na Web, denominados rastreadores web. Neste trabalho é apresentada a pesquisa e desenvolvimento de um algoritmo rastreador que efetua buscas e recupera páginas na Web com conteúdo textual relacionado ao domínio nuclear e seus temas, de forma autônoma e massiva. Este algoritmo foi projetado sob o modelo de um sistema especialista, possuindo, desta forma, uma base de conhecimento que contem tópicos nucleares e palavras-chave que os definem e um mecanismo de inferência constituído por uma rede neural artificial perceptron multicamadas que efetua a estimação da relevância das páginas na Web para um determinado tópico nuclear, no decorrer do processo de busca, utilizando a base de conhecimento. Deste modo, o algoritmo é capaz de, autonomamente, buscar páginas na Web seguindo os hiperlinks que as interconectam e recuperar aquelas que são mais relevantes para o tópico nuclear selecionado, emulando a habilidade que um especialista nuclear tem de navegar na Web e verificar informações nucleares. Resultados experimentais preliminares apresentam uma precisão de recuperação de 80% para o tópico área nuclear em geral e 72% para o tópico de energia nuclear, indicando que o algoritmo proposto é efetivo e eficiente na busca e recuperação de informações relevantes para o domínio nuclear. / Over the last years the Web has obtained an exponential growth, becoming the largest information repository ever created and representing a new and valuable source of potentially useful information for several topics and also for nuclear-related themes. However, due to the Web characteristics and, mainly, because of its huge data volume, finding and retrieving relevant and useful information are non-trivial tasks. This challenge is addressed by web search and retrieval algorithms called web crawlers. This work presents the research and development of a crawler algorithm able to search and retrieve webpages with nuclear-related textual content, in autonomous and massive fashion. This algorithm was designed under the expert systems model, having, this way, a knowledge base that contains a list of nuclear topics and keywords that define them and an inference engine composed of a multi-layer perceptron artificial neural network that performs webpages relevance estimates to some knowledge base nuclear topic while searching the Web. Thus, the algorithm is able to autonomously search the Web by following the hyperlinks that interconnect the webpages and retrieving those that are more relevant to some predefined nuclear topic, emulating the ability a nuclear expert has to browse the Web and evaluate nuclear information. Preliminary experimental results show a retrieval precision of 80% for the nuclear general domain topic and 72% for the nuclear power topic, indicating that the proposed algorithm is effective and efficient to search the Web and to retrieve nuclear-related information.
|
87 |
考慮網站結構之使用者網站漫遊行為的研究 / Efficient Mining of Web Traversal Walks with Site Topology李華富, Lee, Hua-Fu Unknown Date (has links)
隨著全球資訊網的發展,網站吸引了大量的使用者.分析網站中大部分使用者共同的網站瀏覽行為,不但有助於網站結構的設計與更新,也可以對具有相同瀏覽行為的使用者,做有效的個人化服務。
目前有關使用者網站瀏覽行為的研究,所探勘出來的結果多為路徑瀏覽式樣或是網頁循序式樣。因此,我們提出一種新的網站瀏覽式樣:網站漫游,並且提出了兩個演算法AM與PM,來探勘出頻繁使用者網站漫遊行為式樣。
演算法AM是針對要處理的資料量非常龐大,而無法將全部資料存放入主記憶體中的情形所設計的。AM是利用演算法Apriori的精神,來探勘出頻繁使用者網站漫游行為。而演算法PM是針對資料經過轉換後可存放入主記憶體的情形而設計的。PM主要是利用在主記憶體中建立一個樹狀結構,以進一步來壓縮原有資料庫內的大量資料,並利用這個樹狀資料結構來逐步探勘出所有的使用者頻繁網站漫游。在實驗的假設條件下,演算法AM與PM皆展現了線性的執行效率與延展性。 / With progressive expansion in the size and complexity of web site on the World Wide Web, much research has been done on the discovery of useful and interesting Web traversal patterns.
Most existing approaches focus on mining of path traversal patterns or sequential patterns. In this paper, we present a new pattern, Web traversal walks, for mining of Web traversal pattern. A Web traversal walk is the complete trail of a user traversal behavior in a single Web site. Web traversal walk mining is more helpful to understand and predict the behavior of the Web site access patterns.
Two efficient algorithms (i.e., AM and PM) are proposed to discover the Web traversal walks. The algorithm PM is used when the size of database is fit in main memory while AM is not. AM is developed based on the Apriori property to discover all the frequent Web traversal walks from Web logs. In the algorithm PM, a tree structure is constructed in memory from Web logs and the frequent Web traversal walks are generated from the tree structure. Experimental results show that the proposed methods perform well in efficiency and scalability.
|
88 |
Algoritmo rastreador web especialista nuclear / Nuclear expert web crawler algorithmThiago Reis 12 November 2013 (has links)
Nos últimos anos a Web obteve um crescimento exponencial, se tornando o maior repositório de informações já criado pelo homem e representando uma fonte nova e relevante de informações potencialmente úteis para diversas áreas, inclusive a área nuclear. Entretanto, devido as suas características e, principalmente, devido ao seu grande volume de dados, emerge um problema desafiador relacionado à utilização das suas informações: a busca e recuperação informações relevantes e úteis. Este problema é tratado por algoritmos de busca e recuperação de informação que trabalham na Web, denominados rastreadores web. Neste trabalho é apresentada a pesquisa e desenvolvimento de um algoritmo rastreador que efetua buscas e recupera páginas na Web com conteúdo textual relacionado ao domínio nuclear e seus temas, de forma autônoma e massiva. Este algoritmo foi projetado sob o modelo de um sistema especialista, possuindo, desta forma, uma base de conhecimento que contem tópicos nucleares e palavras-chave que os definem e um mecanismo de inferência constituído por uma rede neural artificial perceptron multicamadas que efetua a estimação da relevância das páginas na Web para um determinado tópico nuclear, no decorrer do processo de busca, utilizando a base de conhecimento. Deste modo, o algoritmo é capaz de, autonomamente, buscar páginas na Web seguindo os hiperlinks que as interconectam e recuperar aquelas que são mais relevantes para o tópico nuclear selecionado, emulando a habilidade que um especialista nuclear tem de navegar na Web e verificar informações nucleares. Resultados experimentais preliminares apresentam uma precisão de recuperação de 80% para o tópico área nuclear em geral e 72% para o tópico de energia nuclear, indicando que o algoritmo proposto é efetivo e eficiente na busca e recuperação de informações relevantes para o domínio nuclear. / Over the last years the Web has obtained an exponential growth, becoming the largest information repository ever created and representing a new and valuable source of potentially useful information for several topics and also for nuclear-related themes. However, due to the Web characteristics and, mainly, because of its huge data volume, finding and retrieving relevant and useful information are non-trivial tasks. This challenge is addressed by web search and retrieval algorithms called web crawlers. This work presents the research and development of a crawler algorithm able to search and retrieve webpages with nuclear-related textual content, in autonomous and massive fashion. This algorithm was designed under the expert systems model, having, this way, a knowledge base that contains a list of nuclear topics and keywords that define them and an inference engine composed of a multi-layer perceptron artificial neural network that performs webpages relevance estimates to some knowledge base nuclear topic while searching the Web. Thus, the algorithm is able to autonomously search the Web by following the hyperlinks that interconnect the webpages and retrieving those that are more relevant to some predefined nuclear topic, emulating the ability a nuclear expert has to browse the Web and evaluate nuclear information. Preliminary experimental results show a retrieval precision of 80% for the nuclear general domain topic and 72% for the nuclear power topic, indicating that the proposed algorithm is effective and efficient to search the Web and to retrieve nuclear-related information.
|
89 |
Web mining for social network analysisElhaddad, Mohamed Kamel Abdelsalam 09 August 2021 (has links)
Undoubtedly, the rapid development of information systems and the widespread use of electronic means and social networks have played a significant role in accelerating the pace of events worldwide, such as, in the 2012 Gaza conflict (the 8-day war), in the pro-secessionist rebellion in the 2013-2014 conflict in Eastern Ukraine, in the 2016 US Presidential elections, and in conjunction with the COVID-19 outbreak pandemic since the beginning of 2020. As the number of daily shared data grows quickly on various social networking platforms in different languages, techniques to carry out automatic classification of this huge amount of data timely and correctly are needed.
Of the many social networking platforms, Twitter is of the most used ones by netizens. It allows its users to communicate, share their opinions, and express their emotions (sentiments) in the form of short blogs easily at no cost. Moreover, unlike other social networking platforms, Twitter allows research institutions to access its public and historical data, upon request and under control. Therefore, many organizations, at different levels (e.g., governmental, commercial), are seeking to benefit from the analysis and classification of the shared tweets to serve in many application domains, for examples, sentiment analysis to evaluate and determine user’s polarity from the content of their shared text, and misleading information detection to ensure the legitimacy and the credibility of the shared information. To attain this objective, one can apply numerous data representation, preprocessing, natural language processing techniques, and machine/deep learning algorithms. There are several challenges and limitations with existing approaches, including issues with the management of tweets in multiple languages, the determination of what features the feature vector should include, and the assignment of representative and descriptive weights to these features for different mining tasks. Besides, there are limitations in existing performance evaluation metrics to fully assess the developed classification systems.
In this dissertation, two novel frameworks are introduced; the first is to efficiently analyze and classify bilingual (Arabic and English) textual content of social networks, while the second is for evaluating the performance of binary classification algorithms. The first framework is designed with: (1) An approach to handle Arabic and English written tweets, and can be extended to cover data written in more languages and from other social networking platforms, (2) An effective data preparation and preprocessing techniques, (3) A novel feature selection technique that allows utilizing different types of features (content-dependent, context-dependent, and domain-dependent), in addition to (4) A novel feature extraction technique to assign weights to the linguistic features based on how representative they are in in the classes they belong to. The proposed framework is employed in performing sentiment analysis and misleading information detection. The performance of this framework is compared to state-of-the-art classification approaches utilizing 11 benchmark datasets comprising both Arabic and English textual content, demonstrating considerable improvement over all other performance evaluation metrics. Then, this framework is utilized in a real-life case study to detect misleading information surrounding the spread of COVID-19.
In the second framework, a new multidimensional classification assessment score (MCAS) is introduced. MCAS can determine how good the classification algorithm is when dealing with binary classification problems. It takes into consideration the effect of misclassification errors on the probability of correct detection of instances from both classes. Moreover, it should be valid regardless of the size of the dataset and whether the dataset has a balanced or unbalanced distribution of its instances over the classes. An empirical and practical analysis is conducted on both synthetic and real-life datasets to compare the comportment of the proposed metric against those commonly used. The analysis reveals that the new measure can distinguish the performance of different classification techniques. Furthermore, it allows performing a class-based assessment of classification algorithms, to assess the ability of the classification algorithm when dealing with data from each class separately. This is useful if one of the classifying instances from one class is more important than instances from the other class, such as in COVID-19 testing where the detection of positive patients is much more important than negative ones. / Graduate
|
90 |
Metody dolování relevantních dat z prostředí webu s využitím sociálních sítí / Datamining of Relevenat Information from WWW with Using Social NetworksSmolík, Jakub January 2013 (has links)
This thesis focuses on solving problems related to searching of relevant data on the internet. In text is presented possible solution in form of application capable of automated extraction and aggregation of data from web and their presentation, based on input key words. For this purpouse there were studied and discribed possibilities of automated extraction from three chosen data types, mainly used as data storages on the internet. Furthermore it focuses on ways of data mining from social networks. As a result it pressents planning, implementation, realization and testing of created appliation which can easily find, display and let user easy access searched informations.
|
Page generated in 0.0651 seconds