61 |
Creating a Criterion-Based Information Agent Through Data Mining for Automated Identification of Scholarly Research on the World Wide WebNicholson, Scott 05 1900 (has links)
This dissertation creates an information agent that correctly identifies Web pages containing scholarly research approximately 96% of the time. It does this by analyzing the Web page with a set of criteria, and then uses a classification tree to arrive at a decision.
The criteria were gathered from the literature on selecting print and electronic materials for academic libraries. A Delphi study was done with an international panel of librarians to expand and refine the criteria until a list of 41 operationalizable criteria was agreed upon. A Perl program was then designed to analyze a Web page and determine a numerical value for each criterion.
A large collection of Web pages was gathered comprising 5,000 pages that contain the full work of scholarly research and 5,000 random pages, representative of user searches, which do not contain scholarly research. Datasets were built by running the Perl program on these Web pages. The datasets were split into model building and testing sets.
Data mining was then used to create different classification models. Four techniques were used: logistic regression, nonparametric discriminant analysis, classification trees, and neural networks. The models were created with the model datasets and then tested against the test dataset. Precision and recall were used to judge the effectiveness of each model. In addition, a set of pages that were difficult to classify because of their similarity to scholarly research was gathered and classified with the models.
The classification tree created the most effective classification model, with a precision ratio of 96% and a recall ratio of 95.6%. However, logistic regression created a model that was able to correctly classify more of the problematic pages.
This agent can be used to create a database of scholarly research published on the Web. In addition, the technique can be used to create a database of any type of structured electronic information.
|
62 |
Digital search literacy, self-directed learning and epistemic cognition in a South African undergraduate student sampleHerselman, Taryn Elise January 2016 (has links)
Thesis (M.A (Psychology))--University of the Witwatersrand, Faculty of Humanities, 2016 / Undergraduate students’ require a certain degree of digital literacy in order to make use of the internet as a resource and educational tool. This report argues that two critical aspects of digital search literacy are the student’s ability to effectively execute and monitor the search strategies used to navigate the ever-increasing number of webpages; and the critical thinking skills required to evaluate those documents in an academic context. Therefore, digital literacy requires effective self-directed learning (SDL) skills and appropriate epistemic cognition (EC). The present research used a sequential explanatory design, which comprised of two phases: Stage 1, N = 119 and Stage 2, N=17. The sample for both phases of the project was drawn from students enrolled for first-year level psychology courses at the University of the Witwatersrand. The sample for Stage 2 was drawn from students who had already completed Stage 1, which required the completion an online questionnaire. During the second phase, students were tasked with conducting a web-based search on an essay topic relating to the discipline of psychology. Several research objectives were examined; general self-reported epistemic cognition and readiness for self-directed learning levels of a sample of undergraduate South African university students; self-reported self-directed learning behaviours, epistemic cognition and digital search literacy issues; the impact of search strategies on the type and quality of information sources located; and the psychology-specific epistemic beliefs involved in the evaluation of source features of web based documents.
Findings showed that students did indeed engage specific self-directed learning and epistemic cognition behaviours while searching for resources online. The key components of digital search literacy included, self-directed learning (monitoring and strategy use) and epistemic cognition (source evaluation). In terms of rating the sources, personal justification and justification by authority were the most predominant when students rated the most credible sources; while relevance to task, personal justification and format/style were applied more often when rating the least credible web documents. In conclusion, future research on digital literacy should include the relative contribution of SDL and EC components as important mechanisms for online search strategies and critical source evaluation.
Keywords: self-directed learning, epistemic cognition and beliefs, source evaluation, web search, navigation behaviour, strategies / GR2017
|
63 |
Collecting web data for social science researchLi, Fu Min January 2018 (has links)
University of Macau / Faculty of Social Sciences. / Department of Sociology
|
64 |
An Efficient and Incremental System to Mine Contiguous Frequent SequencesEl-Sayed, Maged F 30 January 2004 (has links)
Mining frequent patterns is an important component of many prediction systems. One common usage in web applications is the mining of users' access behavior for the purpose of predicting and hence pre-fetching the web pages that the user is likely to visit. Frequent sequence mining approaches in the literature are often based on the use of an Apriori-like candidate generation strategy, which typically requires numerous scans of the potentially huge sequence database. In this paper we instead introduce a more efficient strategy for discovering frequent patterns in sequence databases that requires only two scans of the database. The first scan obtains support counts for subsequences of length two. The second scan extracts potentially frequent sequences of any length and represents them as a compressed frequent sequences tree structure (FS-tree). Frequent sequence patterns are then mined from the FS-tree. Incremental and interactive mining functionalities are also facilitated by the FS-tree. As part of this work, we developed the FS-Miner, an system that discovers frequent sequences from web log files. The FS-Miner has the ability to adapt to changes in users' behavior over time, in the form of new input sequences, and to respond incrementally without the need to perform full re-computation. Our system also allows the user to change the input parameters (e.g., minimum support and desired pattern size) interactively without requiring full re-computation in most cases. We have tested our system using two different data sets, comparing it against two other algorithms from the literature. Our experimental results show that our system scales up linearly with the size of the input database. Furthermore, it exhibits excellent adaptability to support threshold decreases. We also show that the incremental update capability of the system provides significant performance advantages over full re-computation even for relatively large update sizes.
|
65 |
Ranking and its applications on web search. / 排序算法及其在網絡搜索中的應用 / Pai xu suan fa ji qi zai wang luo sou suo zhong de ying yongJanuary 2011 (has links)
Wang, Wei. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (p. 106-122). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgement --- p.vi / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Overview --- p.1 / Chapter 1.2 --- Thesis Contributions --- p.5 / Chapter 1.3 --- Thesis Organization --- p.8 / Chapter 2 --- Background and Literature Review --- p.9 / Chapter 2.1 --- Label Ranking in Machine Learning --- p.11 / Chapter 2.1.1 --- Label Ranking --- p.11 / Chapter 2.1.2 --- Semi-Supervised Learning --- p.12 / Chapter 2.1.3 --- The Development of Label Ranking --- p.14 / Chapter 2.2 --- Question Retrieval in Community Question Answering --- p.16 / Chapter 2.2.1 --- Question Retrieval --- p.16 / Chapter 2.2.2 --- Basic Question Retrieval Models --- p.18 / Chapter 2.2.3 --- The Development of Question Retrieval Models --- p.21 / Chapter 2.3 --- Ranking through CTR by Building Click Models --- p.24 / Chapter 2.3.1 --- Click Model's Importance --- p.24 / Chapter 2.3.2 --- A Simple Example of Click Model --- p.25 / Chapter 2.3.3 --- The Development of Click Models --- p.27 / Chapter 3 --- Semi-Supervised Label Ranking --- p.30 / Chapter 3.1 --- Motivation: The Limitations of Supervised Label Ranking --- p.30 / Chapter 3.2 --- Label Ranking and Semi-Supervised Learning Framework --- p.32 / Chapter 3.2.1 --- Label Ranking and Semi-Supervised Learning Setup --- p.32 / Chapter 3.2.2 --- Information Gain Decision Tree for Label Ranking --- p.37 / Chapter 3.2.3 --- Instance Based Label Ranking --- p.39 / Chapter 3.2.4 --- Mallows Model Decision Tree for Label Ranking --- p.40 / Chapter 3.3 --- Experiments --- p.40 / Chapter 3.3.1 --- Dataset Description --- p.41 / Chapter 3.3.2 --- Experimental Results --- p.42 / Chapter 3.3.3 --- Discussion --- p.42 / Chapter 3.4 --- Summary --- p.44 / Chapter 4 --- An Application of Label Ranking --- p.45 / Chapter 4.1 --- Motivation: The Limitations of Traditional Question Retrieval --- p.45 / Chapter 4.2 --- Intention Detection Using Label Ranking --- p.47 / Chapter 4.2.1 --- Question Intention Detection --- p.48 / Chapter 4.2.2 --- Label Ranking Algorithms --- p.50 / Chapter 4.2.3 --- Some Other Learning Algorithms --- p.53 / Chapter 4.3 --- Improved Question Retrieval Using Label Ranking --- p.54 / Chapter 4.3.1 --- Question Retrieval Models --- p.55 / Chapter 4.3.2 --- Improved Question Retrieval Model --- p.55 / Chapter 4.4 --- Experimental Setup --- p.56 / Chapter 4.4.1 --- Experiment Objective --- p.56 / Chapter 4.4.2 --- Experiment Design --- p.56 / Chapter 4.4.3 --- DataSet Description --- p.57 / Chapter 4.4.4 --- Question Feature --- p.59 / Chapter 4.5 --- Experiment Result and Comments --- p.60 / Chapter 4.5.1 --- Question Classification --- p.60 / Chapter 4.5.2 --- Classification Enhanced Question Retrieval --- p.63 / Chapter 4.6 --- Summary --- p.69 / Chapter 5 --- Ranking by CTR in Click Models --- p.71 / Chapter 5.1 --- Motivation: The Relational Influence's Importance in Click Models --- p.71 / Chapter 5.2 --- Click Models in Sponsored Search --- p.75 / Chapter 5.2.1 --- A Brief Review on Click Models --- p.76 / Chapter 5.3 --- Collaborating Influence Identification from Data Analysis --- p.77 / Chapter 5.3.1 --- Quantity Analysis --- p.77 / Chapter 5.3.2 --- Psychology Interpretation --- p.82 / Chapter 5.3.3 --- Applications Being Influenced --- p.82 / Chapter 5.4 --- Incorporating Collaborating Influence into CCM . --- p.83 / Chapter 5.4.1 --- Dependency Analysis of CCM --- p.83 / Chapter 5.4.2 --- Extended CCM --- p.84 / Chapter 5.4.3 --- Algorithms --- p.85 / Chapter 5.5 --- Incorporating Collaborating Influence into TCM . --- p.87 / Chapter 5.5.1 --- TCM --- p.87 / Chapter 5.5.2 --- Extended TCM --- p.88 / Chapter 5.5.3 --- Algorithms --- p.88 / Chapter 5.6 --- Experiment --- p.90 / Chapter 5.6.1 --- Dataset Description --- p.90 / Chapter 5.6.2 --- Experimental Setup --- p.91 / Chapter 5.6.3 --- Evaluation Metrics --- p.91 / Chapter 5.6.4 --- Baselines --- p.92 / Chapter 5.6.5 --- Performance on RMS --- p.92 / Chapter 5.6.6 --- Performance on Click Perplexity --- p.93 / Chapter 5.6.7 --- Performance on Log-Likelihood --- p.93 / Chapter 5.6.8 --- Significance Discussion --- p.98 / Chapter 5.6.9 --- Sensitivity Analysis --- p.98 / Chapter 5.7 --- Summary --- p.102 / Chapter 6 --- Conclusion and Future Work --- p.103 / Chapter 6.1 --- Conclusion --- p.103 / Chapter 6.2 --- Future Work --- p.105 / Bibliography --- p.106
|
66 |
A personalised query expansion approach using contextSeher, Indra, University of Western Sydney, College of Health and Science, School of Computing and Mathematics January 2007 (has links)
Users of the Web usually use search engines to find answers to a variety of questions. Although search engines can rapidly process a large number of Web documents, in many cases, the answers returned by search engines are not relevant to the user’s information need, although they do contain the same keywords as the query. This is because the Web contains information sources created by numerous authors independently, and the authors’ vocabularies vary greatly. Furthermore, most words in natural languages have inherent ambiguity. This vocabulary mismatch between user queries and Web sources is often addressed through query expansion. Moreover, user questions are often short. The results of a search can be improved when the length of the question is long. Various query expansion methods that add useful question-related terms before processing the question have been proposed and proven to increase the performance of the result. Some of these query expansion methods add contextual information related to the user and the question. On the other hand, human communications are quite successful and seem to be very easy. This is mainly due to the understanding of language and the world knowledge that humans have. Human communication is more successful when there is an implicit understanding of everyday situations of others who take part in the communication. Here the implicit situational information, or the “context” that humans share, enables them to have a more meaningful interaction amongst themselves. Similar to human–human communications, improving computers’ access to context can increase the richness of human–computer communications, giving more useful computational services to users. Based on the above factors, this research proposes a method to make use of context in order to understand and process user requests. Here, the term “context” means the meanings associated with key query terms and preferences that have to be decided in order to process the query. As in a natural environment, results produced to different users for the same question could vary in an automated system. If the automated system knows users’ preferences related to the question, then it could make use of these preferences to process user queries, producing more relevant and useful results to the user. Hence, a new approach for a personalised query expansion is proposed in this research, where user queries are expanded with user preferences and hence the expanded queries that will be used for processing vary for different users. An architecture that is required for such a Web application to carryout a personalised query expansion with contextual information is also proposed in the thesis. The preferences that could be used for the query expansion are therefore user-specific. Users have different set of preferences depending on the tasks they want to perform. Similar tasks that have same types of preferences can be grouped into task based domains. Hence, user preferences will be the same in a domain, and will vary across domains. Furthermore, there can be different types of subtasks that could be performed within a domain. The set of preferences that could be used for each sub task could vary, and it will be a sub set of the set of preferences of the domain. Hence, an approach for a personalised query expansion which adds user, domain and task-specific preferences to user queries is proposed in this research. The main stages of this expansion are identified and discussed in this thesis. Each of these stages requires different contextual information which is represented in the context model. Out of the main stages identified in the query expansion process, the first three stages, the domain identification, task identification, and missing parameter identification, are explored in the thesis. As the preferences used for the expansion depend on the query domain, it is necessary to identify the domain of the query at first instance. Hence, a domain identification algorithm which makes use of eight different features is proposed in the thesis to identify domains of given queries. This domain identification also reduces the ambiguity of query terms. When the query domain is identified, context/associating meanings of query terms are known. This limits the scope of the possible misinterpretations of query terms. A domain ontology, domain dictionary, and user profile are used by the domain identification algorithm. The domain ontology consists of objects and their categories, attributes of objects and their categories, relationships among objects, and instances and their categories in the domain. The domain dictionary consists of objects and attributes. This is created automatically from the domain ontology. The user profile has the long term preferences of the user that are domain-specific and general. When the domain of the query is known, in order to decide the preferences of the user, the task specified in the query has to be identified. This task identification process is found to be similar in domains with similar activities. Hence, domains are grouped at this stage. These domain groups and the rules that could be used to find out the tasks in the domain groups are identified and discussed in the thesis. For each sub tasks in the domain groups, the types of preferences that could be used to expand user queries are identified and are used to expand user queries. An experiment is designed to evaluate the performance of the proposed approach. The first three stages of the query expansion, the domain identification, task identification, and missing parameter identification, are implemented and evaluated. Samples of five domains are implemented, and queries are collected in these domains from various users. In order to create new domains, a wizard is provided by the system. This system also allows editing the existing domains, domain groups, and types of preferences in sub tasks of the domain groups. Instances of the attributes are manually identified and added to the system using the interface provided by the system. In each of the stages of the query expansion, the results of the queries are manually identified, and are compared with the results produced by the system. The results have confirmed that the proposed method has a positive impact in query expansion. The experiments, results and evaluation of the proposed query expansion approach are also presented in the thesis. The proposed approach for the query expansion could be used by search engines, organisations with a limited set of task domains, and any application that can be improved by making use of personalised query expansion. / Doctor of Philosophy (PhD)
|
67 |
Visibility of e-commerce websites to search engines : a comparison between text-based and graphic-based hyperlinks /Ngindana, Mongezi. January 2006 (has links)
Thesis (MTech (Information Technology))--Cape Peninsula University of Technology, 2006. / Includes bibliographical references (leaves: 77-86). Also available online.
|
68 |
Search engine strategies : a model to improve website visibility for SMME website /Chambers, Rickard. January 2005 (has links)
Thesis (MTech (Information Technology))--Cape Peninsula University of Technology, Cape Town, 2005. / Includes bibliographical references (p. 132-142). Also available online.
|
69 |
Can You Find Me Now?: Re-examining Search Engines’ Capability to Retrieve Finding Aids on the World Wide WebPeter E. Hymas 15 July 2005 (has links)
Five years have passed since Helen R. Tibbo and Lokman I. Meho conducted their study exploring how well six Web search engines retrieved electronic finding aids based on phrase and word searches of terms taken directly from the finding aids. This study similarly seeks to discover how well current search engines Google, Yahoo! Search, MSN Search, AOL Search, Excite, and Ask Jeeves retrieved finding aids chosen at random from 25 North American primary source repositories. In March 2005, approximately 27% of repositories listed at the “Repositories of Primary Resources” web site had at least four full finding aids online, a substantial increase from 8% in 2000. This study affirmed phrase searches yielding better retrieval results than word searches. Encouragingly, the retrieval rates for phrase and word searches within electronic finding aids were approximately 20% higher than Tibbo and Meho’s findings despite the existence of several billion more World Wide Web pages in 2005.
|
70 |
The Study for the Usage and Satisfaction of Internet Information Searchers for Internet Searching ToolsChuang, Ya-Ping 09 August 2005 (has links)
The Internet has become a very important tool in the modern life and a useful channel for people to find information. Although there is a lot of information in the internet, it does not mean that Internet users can always find what they really need and want. The difficulty for Internet users is not that they can¡¦t find out the information but that too much information exists. Therefore, good search tools that can help them find useful information become critical to productivity improvement. The purpose of this research is to investigate how people use different tools on the Internet to find information they are looking for. Three factors were examined: task characteristics, search tools, and user characteristics.
Experiments were conducted on tasks designed to show different characteristics. Four task types featured by different uncertainty and equivocality levels and two Internet search tools were used in the experiment. Users were asked to use different tools for solving the assigned tasks to see whether their satisfaction differs under different settings.
The results indicate that the effect of task types is not significant. That is, user satisfaction of using search engines is similar under different circumstances. Using the search tool itself is the key factor that affects the level of satisfaction. Therefore, the most important thing for internet user to find useful information is to adopt a proper search tool.
|
Page generated in 0.1001 seconds