Global ETD Search

91	Classification of Patterns in Streaming Data Using Clustering Signatures Awodokun, Olugbenga January 2017 (has links) No description available. Computer Science data mining hierarchical clustering unsupervised learning data analytics machine learning intrusion detection
92	Higher-order reasoning with graph data Leonardo de Abreu Cotta (13170135) 29 July 2022 (has links) <p>Graphs are the natural framework of many of today’s highest impact computing applications: from online social networking, to Web search, to product recommendations, to chemistry, to bioinformatics, to knowledge bases, to mobile ad-hoc networking. To develop successful applications in these domains, we often need representation learning methods ---models mapping nodes, edges, subgraphs or entire graphs to some meaningful vector space. Such models are studied in the machine learning subfield of graph representation learning (GRL). Previous GRL research has focused on learning node or entire graph representations through associational tasks. In this work I study higher-order (k>1-node) representations of graphs in the context of both associational and counterfactual tasks.<br> </p> Knowledge representation and reasoning Deep learning Neural networks Semi- and unsupervised learning graph embeddings causal inference mcmc
93	Evaluation of Unsupervised Anomaly Detection in Structured API Logs : A Comparative Evaluation with a Focus on API Endpoints Hult, Gabriel January 2024 (has links) With large quantities of API logs being stored, it becomes difficult to manually inspect them and determine whether the requests are benign or anomalies, indicating incorrect access to an application or perhaps actions with malicious intent. Today, companies can rely on third-party penetration testers who occasionally attempt various techniques to find vulnerabilities in software applications. However, to be a self-sustainable company, implementing a system capable of detecting abnormal traffic which could be malicious would be beneficial. By doing so, attacks can be proactively prevented, mitigating risks faster than waiting for third parties to detect these issues. A potential solution is applying machine learning, specifically anomaly detection, which detects patterns that do not conform to normal standards. This thesis covers the process of having structured log data to find anomalies in the log data. Various unsupervised anomaly detection models were evaluated on their capabilities of detecting anomalies in API logs. These models were K-means, Gaussian Mixture Model, Isolation Forest and One-Class Support Vector Machine. The findings from the evaluation show that the Gaussian Mixture Model was the best baseline model, reaching a precision of 63%, a recall of 72%, resulting in an F1-score of 0.67, an AUC score of 0.76 and an accuracy of 0.71. By tuning the models, Isolation Forest performed the best with a precision of 67% and a recall of 80%, resulting in an F1-score of 0.73, an AUC score of 0.83 and an accuracy of 0.75. The pros and cons of each model are presented and discussed along with insights related to anomaly detection and its applicability in API log analysis and API security. machine learning api security security unsupervised learning Computer Sciences Datavetenskap (datalogi)
94	Machine Learning Approaches for Modeling and Correction of Confounding Effects in Complex Biological Data Wu, Chiung Ting 09 June 2021 (has links) With the huge volume of biological data generated by new technologies and the booming of new machine learning based analytical tools, we expect to advance life science and human health at an unprecedented pace. Unfortunately, there is a significant gap between the complex raw biological data from real life and the data required by mathematical and statistical tools. This gap is contributed by two fundamental and universal problems in biological data that are both related to confounding effects. The first is the intrinsic complexities of the data. An observed sample could be the mixture of multiple underlying sources and we may be only interested in one or part of the sources. The second type of complexities come from the acquisition process of the data. Different samples may be gathered at different time and/or from different locations. Therefore, each sample is associated with specific distortion that must be carefully addressed. These confounding effects obscure the signals of interest in the acquired data. Specifically, this dissertation will address the two major challenges in confounding effects removal: alignment and deconvolution. Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention time (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. To utilize this information, we develop an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. We applied ncGTW to two large-scale metabolomics LC-MS datasets, which identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. When the desired signal is buried in a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of individual sources, instead of mixtures. Though there are some promising supervised deconvolution methods, when there is no a priori information, unsupervised deconvolution is still needed. Among current unsupervised methods, Convex Analysis of Mixtures (CAM) is the most theoretically solid and strongest performing one. However, there are some major limitations of this method. Most importantly, the overall time complexity can be very high, especially when analyzing a large dataset or a dataset with many sources. Also, since there are some stochastic and heuristic steps, the deconvolution result is not accurate enough. To address these problems, we redesigned the modules of CAM. In the feature clustering step, we propose a clustering method, radius-fixed clustering, which could not only control the space size of the cluster, but also find out the outliers simultaneously. Therefore, the disadvantages of K-means clustering, such as instability and the need of cluster number are avoided. Moreover, when identifying the convex hull, we replace Quickhull with linear programming, which decreases the computation time significantly. To avoid the not only heuristic but also approximated step in optimal simplex identification, we propose a greedy search strategy instead. The experimental results demonstrate the vast improvement of computation time. The accuracy of the deconvolution is also shown to be higher than the original CAM. / Doctor of Philosophy / Due to the complexity of biological data, there are two major pre-processing steps: alignment and deconvolution. The alignment step corrects the time and location related data acquisition distortion by aligning the detected signals to a reference signal. Though many alignment methods are proposed for biological data, most of them fail to consider the relationships among samples carefully. This piece of structure information can help alignment when the data is noisy and/or irregular. To utilize this information, we develop a new method, Neighbor-wise Compound-specific Graphical Time Warping (ncGTW), inspired by graph theory. This new alignment method not only utilizes the structural information but also provides a reference-free solution. We show that the performance of our new method is better than other methods in both simulations and real datasets. When the signal is from a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of single sources, instead of mixtures. There is a classic unsupervised deconvolution method: Convex Analysis of Mixtures (CAM). However, there are some limitations of this method. For example, the time complexity of some steps is very high. Thus, when facing a large dataset or a dataset with many sources, the computation time would be extremely long. Also, since there are some stochastic and heuristic steps, the deconvolution result may be not accurate enough. We improved CAM and the experimental results show that the speed and accuracy of the deconvolution is significantly improved. bioinformatics multiple alignment deconvolution unsupervised learning convex analysis feature selection tissue heterogeneity
95	Swarm Unmanned Aerial Vehicle Networks in Wireless Communications: Routing Protocol, Multicast, and Data Exchange Song, Hao 24 March 2021 (has links) Unmanned aerial vehicle (UAV) networks, a flying platform, are a promising wireless communications infrastructure with wide-ranging applications in both commercial and military domain. Owing to the appealing characteristics, such as high mobility, high feasibility, and low cost, UAV networks can be applied in various scenarios, such as emergency communications, cellular networks, device-to-device (D2D) networks, and sensor networks, regardless of infrastructure and spatial constraints. To handle complicated missions, provide wireless coverage for a large range, and have a long lifetime, a UAV network may consist of a large amount of UAVs, working cooperatively as a swarm, also referred to as swarm UAV networks. Although high mobility and numerous UAVs offer high flexibility, high scalability, and performance enhancement for swarm UAV networks, they also incur some technical challenges. One of the major challenges is the routing protocol design. With high mobility, a dynamic network topology may be encountered. As a result, traditional routing protocols based on routing path discovery are not applicable in swarm UAV networks, as the discovered routing path may be outdated especially when the amount of UAVs is large causing considerable routing path discovery delay. Multicast is an essential and key technology in the scenarios, where swarm UAV networks are employed as aerial small base station (BSs), like relay or micro BS. Swarm UAV networks consisting of a large amount of UAVs will encounter severe multicast delay with existing multicast methods using acknowledgement (ACK) feedback and retransmissions. This issue will be deteriorated when a swarm UAV network is deployed far away from BSs, causing high packet loss. Data exchange is another major technical challenge in swarm UAV networks, where UAVs exchange data packets with each other, such as requesting and retrieving lost packets. Due to numerous UAVs, data exchange between UAVs can cause message and signaling storm, resulting in a long data exchange delay and severe ovehead. In this dissertation, I focus on developing novel routing protocols, multicast schemes, and data exchange schemes, enabling efficient, robust, and high-performance routing, multicast, and data exchange in swarm UAV networks. To be specific, two novel flooding-based routing protocols are designed in this dissertation, where random network coding (RNC) is utilized to improve the efficiency of the flooding-based routing in swarm UAV networks without relying on network topology information and routing path discovery. Using the property of RNC that as long as sufficient different versions of encoded packets/generations are accumulated, original packets could be decoded, RNC is naturally able to accelerate the routing process. This is because the use of RNC can reduce the number of encoded packets that are required to be delivered in some hop. In a hop, the receiver UAV may have already overheard some generations in previous hops, so that it only needs to receive fewer generations from the transmitter UAV in the current hop. To further expedite the flooding-based routing, the second flooding-based routing protocol is designed, where each forwarding UAV creates a new version of generation by linearly combining received generations rather than by decode original packets. Despite the flooding-based routing significantly hastened by RNC, the inherent drawback of the flooding-based routing is still unsolved, namely numerous hops. Aiming at reducing the amount of hops, a novel enhanced flooding-based routing protocol leveraging clustering is designed, where the whole UAV network will be partitioned into multiple clusters and in each cluster only one UAV will be selected as the representative of this cluster, participating in the flooding-based routing process. By this way, the number of hops is restricted by the number of representatives, since packets are only flooded between limited representatives rather than numerous UAVs. To address the multicast issue in swarm UAV networks, a novel multicast scheme is proposed based on clustering, where a UAV experiencing packet loss will retrieve the lost packets by requesting other UAVs in the same cluster without depending on retransmissions of BSs. In this way, the lost packet retrieval is carried out through short-distance data exchange between UAVs with reliable transmissions and a short delay. Tractable stochastic geometry tools are used to model swarm UAV networks with a dynamic network topology, based on which comprehensive analytical performance analysis is given. To enable efficient data exchange between UAVs in swarm UAV networks, a data exchange scheme is proposed utilizing unsupervised learning. With the proposed scheme, all UAVs are assigned to multiple clusters and a UAV can only carry out data exchange within its cluster. By this way, UAVs in different clusters perform data exchange in a parallel fashion to expedite data exchange. The agglomerative hierarchical clustering, a type of unsupervised learning, is used to conduct clustering in order to guarantee that UAVs in the same cluster are able to supply and supplement each other's lost packets. Additionally, a data exchange mechanism, including a novel random backoff procedure, is designed, where the priorities of UAVs in data exchange determined by the number of their lost packets or requested packets that they can provide. As a result, each request-reply process would be taken fully advantage, maximally supplying lost packets not only to the UAV sending request, but also to other UAVs in the same cluster. For all the developed technologies in this dissertation, their technical details and the corresponding system procedures are designed based on low-complexity and well-developed technologies, such as the carrier sense multiple access/collision avoidance (CSMA/CA), for practicability in practice and without loss of generality. Moreover, extensive simulation studies are conducted to demonstrate the effectiveness and superiority of the proposed and developed technologies. Additionally, system design insights are also explored and revealed through simulations. / Doctor of Philosophy / Compared to fixed infrastructures in wireless communications, unmanned aerial vehicle (UAV) networks possess some significant advantages, such as low cost, high mobility, and high feasibility, making UAV networks have a wide range of applications in both military and commercial fields. However, some characteristics of UAV networks, including dynamic network topology and numerous UAVs, may become technical barriers for wireless communications. One of the major challenges is the routing protocol design. Routing is the process of selecting a routing path, enabling data delivered from a node (source) to another desired node (destination). Traditionally, routing is performed based on routing path discovery, where control packets are broadcasted and the path, on which a control packet first reaches the destination, will be selected as routing path. However, in UAV networks, routing path discovery may experience a long delay, as control packets go through many UAVs. Besides, the discovered routing path may be outdated, as the topology of UAV networks change over time. Another key technology in wireless communications that may not work well in UAV networks is multicast, where a transmitter, like a base station (BS), broadcasts data to UAVs and all UAVs are required to receive this data. With numerous UAVs, multicast delay may be severe, since the transmitter will keep retransmitting a data packet to UAVs until all UAVs successfully receive the packet. This issue will be deteriorated when a UAV network is deployed far away from BSs, causing high packet loss. Data exchange between UAVs is a fundamental and important system procedure in UAV networks. A large amount of UAV in a UAV network will cause serious data exchange delay, as many UAVs have to compete for limited wireless resources to request or send data. In this dissertation, I focus on developing novel technologies and schemes for swarm UAV networks, where a large amount of UAVs exist to make UAV networks powerful and handle complicated missions, enable efficient, robust, and high-performance routing, multicast, and data exchange system procedures. To be specific, two novel flooding-based routing protocols are designed, where random network coding (RNC) is utilized to improve the efficiency of flooding-based routing without relying on any network topology information or routing path discovery. The use of RNC could naturally expedite flooding-based routing process. With RNC, a receiver can decode original packets as long as it accumulates sufficient encoded packets, which may be sent by different transmitters in different hops. As a result, in some hops, fewer generations may be required to be transmitted, as receivers have already received and accumulated some encoded in previous hops. To further improve the efficiency of flooding-based routing, another routing protocol using RNC is designed, where UAVs create new encoded packets by linearly combining received encoded packets rather than linearly combing original packets. Apparently, this method would be more efficient. UAVs do not need to collect sufficient encoded packets and decode original packets, while only linearly combining all received encoded packets. Although RNC could effectively improve the efficiency of flooding-based routing, the inherent drawback is still unsolved, which is a large amount of hops caused by numerous UAVs. Thus, an enhanced flooding-based routing protocol using clustering is designed, where the whole UAV network will be partitioned into multiple clusters. In each cluster only one UAV will be selected as the representative of this cluster, participating in the flooding-based routing process. By this way, the number of hops could be greatly reduced, as packets are only flooded between limited representatives rather than numerous UAVs. To address the multicast issue in swarm UAV networks, a novel multicast scheme is proposed, where a UAV experiencing packet loss will retrieve its lost packets by requesting other UAVs in the same cluster without depending on retransmissions of BSs. In this way, the lost packet retrieval is carried out through short-distance data exchange between UAVs with reliable transmissions and a short delay. Then, the optimal number of clusters and the performance of the proposed multicast scheme are investigated by tractable stochastic geometry tools. If all UAVs closely stay together in a swarm UAV network, long data exchange delay would be significant technical issue, since UAVs will cause considerable interference to each other and all UAVs will compete for spectrum access. To cope with that, a data exchange scheme is proposed leveraging unsupervised learning. To avoid interference between UAVs and a long-time waiting for spectrum access, all UAVs are assigned to multiple clusters and different clusters use different frequency bands to carry out data exchange simultaneously. The agglomerative hierarchical clustering, a type of unsupervised learning, is used to conduct clustering, guaranteeing that UAVs in the same cluster are able to supply and supplement each other's lost packets. Additionally, a data exchange mechanism is designed, facilitating that a UAV with more lost packets or more requested packets has a higher priority to carry out data exchange. In this way, each request-reply process would be taken fully advantage, maximally supplying lost packets not only to the UAV sending request, but also to other UAVs in the same cluster. For all the developed technologies in this dissertation, their technical details and the corresponding system procedures are designed based on low-complexity and well-developed technologies, such as the carrier sense multiple access/collision avoidance (CSMA/CA), for practicability in reality and without loss of generality. Moreover, extensive simulation studies are conducted to demonstrate the effectiveness and superiority of the developed technologies. Additionally, system design insights are also explored and revealed through simulations. Swarm UAV networks routing protocol multicast data exchange random network coding clustering stochastic geometry unsupervised learning.
96	Sequential Pattern Mining: A Proposed Approach for Intrusion Detection Systems Lefoane, Moemedi, Ghafir, Ibrahim, Kabir, Sohag, Awan, Irfan U. 19 December 2023 (has links) No / Technological advancements have played a pivotal role in the rapid proliferation of the fourth industrial revolution (4IR) through the deployment of Internet of Things (IoT) devices in large numbers. COVID-19 caused serious disruptions across many industries with lockdowns and travel restrictions imposed across the globe. As a result, conducting business as usual became increasingly untenable, necessitating the adoption of new approaches in the workplace. For instance, virtual doctor consultations, remote learning, and virtual private network (VPN) connections for employees working from home became more prevalent. This paradigm shift has brought about positive benefits, however, it has also increased the attack vectors and surfaces, creating lucrative opportunities for cyberattacks. Consequently, more sophisticated attacks have emerged, including the Distributed Denial of Service (DDoS) and Ransomware attacks, which pose a serious threat to businesses and organisations worldwide. This paper proposes a system for detecting malicious activities in network traffic using sequential pattern mining (SPM) techniques. The proposed approach utilises SPM as an unsupervised learning technique to extract intrinsic communication patterns from network traffic, enabling the discovery of rules for detecting malicious activities and generating security alerts accordingly. By leveraging this approach, businesses and organisations can enhance the security of their networks, detect malicious activities including emerging ones, and thus respond proactively to potential threats. Scanning detection Sequential pattern mining Unsupervised learning Intrusion detection system Network security
97	A fuzzy logic solution for navigation of the Subsurface Explorer planetary exploration robot Gauss, Veronica A. 22 August 2008 (has links) An unsupervised fuzzy logic navigation algorithm is designed and implemented in simulation for the Subsurface Explorer planetary exploration robot. The robot is intended for the subterranean exploration of Mars, and will be equipped with acoustic sensing for detecting obstacles. Measurements of obstacle distance and direction are anticipated to be imprecise however, since the performance of acoustic sensors is degraded in underground environments. Fuzzy logic is a satisfactory means of addressing imprecision in plant characteristics, and has been implemented in a variety of autonomous vehicle navigation applications. However, most fuzzy logic algorithms that perform well in unknown environments have large rule-bases or use complex methods for tuning fuzzy membership functions and rules. These qualities make them too computationally intensive to be used for planetary exploration robots like the SSX. In this thesis, we introduce an unsupervised fuzzy logic algorithm that can determine a trajectory for the SSX through unknown environments. This algorithm uses a combination of simple fusion of robot behaviors and self-tuning membership functions to determine robot navigation without resorting to the degree of complexity of previous fuzzy logic algorithms. Finally, we present some simulation results that demonstrate the practicality of our algorithm in navigating in different environments. The simulations justify the use of our fuzzy logic technique, and suggest future areas of research for fuzzy logic navigation algorithms. / Master of Science fuzzy logic mobile robots unsupervised learning Subsurface Explorer Mars exploration LD5655.V855 1997.G387
98	<b>MOUSE SOCIAL BEHAVIOR CLASSIFICATION USING SELF-SUPERVISED LEARNING TECHNIQUES</b> Sruthi Sundharram (18437772) 27 April 2024 (has links) <p dir="ltr">Traditional methods of behavior classification on videos of mice often rely on manually annotated datasets, which can be labor-intensive and resource-demanding to create. This research aims to address the challenges of behavior classification in mouse studies by leveraging an algorithmic framework employing self-supervised learning techniques capable of analyzing unlabeled datasets. This research seeks to develop a novel approach that eliminates the need for extensive manual annotation, making behavioral analysis more accessible and cost-effective for researchers, especially those in laboratories with limited access to annotated datasets.</p> Animal behaviour Computer vision Deep learning Semi- and unsupervised learning Behavioural neuroscience semi supervised learning
99	Mathematical Modeling and Deconvolution for Molecular Characterization of Tissue Heterogeneity Chen, Lulu 22 January 2020 (has links) Tissue heterogeneity, arising from intermingled cellular or tissue subtypes, significantly obscures the analyses of molecular expression data derived from complex tissues. Existing computational methods performing data deconvolution from mixed subtype signals almost exclusively rely on supervising information, requiring subtype-specific markers, the number of subtypes, or subtype compositions in individual samples. We develop a fully unsupervised deconvolution method to dissect complex tissues into molecularly distinctive tissue or cell subtypes directly from mixture expression profiles. We implement an R package, deconvolution by Convex Analysis of Mixtures (debCAM) that can automatically detect tissue or cell-specific markers, determine the number of constituent sub-types, calculate subtype proportions in individual samples, and estimate tissue/cell-specific expression profiles. We demonstrate the performance and biomedical utility of debCAM on gene expression, methylation, and proteomics data. With enhanced data preprocessing and prior knowledge incorporation, debCAM software tool will allow biologists to perform a deep and unbiased characterization of tissue remodeling in many biomedical contexts. Purified expression profiles from physical experiments provide both ground truth and a priori information that can be used to validate unsupervised deconvolution results or improve supervision for various deconvolution methods. Detecting tissue or cell-specific expressed markers from purified expression profiles plays a critical role in molecularly characterizing and determining tissue or cell subtypes. Unfortunately, classic differential analysis assumes a convenient test statistic and associated null distribution that is inconsistent with the definition of markers and thus results in a high false positive rate or lower detection power. We describe a statistically-principled marker detection method, One Versus Everyone Subtype Exclusively-expressed Genes (OVESEG) test, that estimates a mixture null distribution model by applying novel permutation schemes. Validated with realistic synthetic data sets on both type 1 error and detection power, OVESEG-test applied to benchmark gene expression data sets detects many known and de novo subtype-specific expressed markers. Subsequent supervised deconvolution results, obtained using markers detected by the OVESEG-test, showed superior performance when compared with popular peer methods. While the current debCAM approach can dissect mixed signals from multiple samples into the 'averaged' expression profiles of subtypes, many subsequent molecular analyses of complex tissues require sample-specific deconvolution where each sample is a mixture of 'individualized' subtype expression profiles. The between-sample variation embedded in sample-specific subtype signals provides critical information for detecting subtype-specific molecular networks and uncovering hidden crosstalk. However, sample-specific deconvolution is an underdetermined and challenging problem because there are more variables than observations. We propose and develop debCAM2.0 to estimate sample-specific subtype signals by nuclear norm regularization, where the hyperparameter value is determined by random entry exclusion based cross-validation scheme. We also derive an efficient optimization approach based on ADMM to enable debCAM2.0 application in large-scale biological data analyses. Experimental results on realistic simulation data sets show that debCAM2.0 can successfully recover subtype-specific correlation networks that is unobtainable otherwise using existing deconvolution methods. / Doctor of Philosophy / Tissue samples are essentially mixtures of tissue or cellular subtypes where the proportions of individual subtypes vary across different tissue samples. Data deconvolution aims to dissect tissue heterogeneity into biologically important subtypes, their proportions, and their marker genes. The physical solution to mitigate tissue heterogeneity is to isolate pure tissue components prior to molecular profiling. However, these experimental methods are time-consuming, expensive and may alter the expression values during isolation. Existing literature primarily focuses on supervised deconvolution methods which require a priori information. This approach has an inherent problem as it relies on the quality and accuracy of the a priori information. In this dissertation, we propose and develop a fully unsupervised deconvolution method - deconvolution by Convex Analysis of Mixtures (debCAM) that can estimate the mixing proportions and 'averaged' expression profiles of individual subtypes present in heterogeneous tissue samples. Furthermore, we also propose and develop debCAM2.0 that can estimate 'individualized' expression profiles of participating subtypes in complex tissue samples. Subtype-specific expressed markers, or marker genes (MGs), serves as critical a priori information for supervised deconvolution. MGs are exclusively and consistently expressed in a particular tissue or cell subtype while detecting such unique MGs involving many subtypes constitutes a challenging task. We propose and develop a statistically-principled method - One Versus Everyone Subtype Exclusively-expressed Genes (OVESEG-test) for robust detection of MGs from purified profiles of many subtypes. bioinformatics deconvolution unsupervised learning convex analysis feature selection tissue heterogeneity biomarkers
100	Collaboration between UK universities : a machine-learning based webometric analysis Kenekayoro, Patrick January 2014 (has links) Collaboration is essential for some types of research, which is why some agencies include collaboration among the requirements for funding research projects. Studying collaborative relationships is important because analyses of collaboration networks can give insights into knowledge based innovation systems, the roles that different organisations play in a research field and the relationships between scientific disciplines. Co-authored publication data is widely used to investigate collaboration between organisations, but this data is not free and thus may not be accessible for some researchers. Hyperlinks have some similarities with citations, so hyperlink data may be used as an indicator to estimate the extent of collaboration between academic institutions and may be able to show types of relationships that are not present in co-authorship data. However, it has been shown that using raw hyperlink counts for webometric research can sometimes produce unreliable results, so researchers have attempted to find alternate counting methods and have tried to identify the reasons why hyperlinks may have been created in academic websites. This thesis uses machine learning techniques, an approach that has not previously been widely used in webometric research, to automatically classify hyperlinks and text in university websites in an attempt to filter out irrelevant hyperlinks when investigating collaboration between academic institutions. Supervised machine learning methods were used to automatically classify the web page types that can be found in Higher Education Institutions’ websites. The results were assessed to see whether ii automatically filtered hyperlink data gave better results than raw hyperlink data in terms of identifying patterns of collaboration between UK universities. Unsupervised learning methods were used to automatically identify groups of university departments that are collaborating or that may benefit from collaborating together, based on their co-appearance in research clusters. Results show that the machine learning methods used in this thesis can automatically identify both the source and target web page categories of hyperlinks in university websites with up to 78% accuracy; which means that it can increase the possibility for more effective hyperlink classification or for identifying the reasons why hyperlinks may have been created in university websites, if those reasons can be inferred from the relationship between the source and target page types. When machine learning techniques were used to filter hyperlinks that may not have been created because of collaboration from the hyperlink data, there was an increased correlation between hyperlink data and other collaboration indicators. This emphasises the possibility for using machine learning methods to make hyperlink data a more reliable data source for webometric research. The reasons for university name mentions in the different web page types found in an academic institution’s website are broadly the same as the reasons for link creation, this means that classification based on inter-page relationships may also be used to improve name mentions data for webometrics research. iii Clustering research groups based on the text in their homepages may be useful for identifying those research groups or departments with similar research interests which may be valuable for policy makers in monitoring research fields; based on the sizes of identified clusters and for identifying future collaborators; based on co-appearances in clusters, if identical research interests is a factor that can influence the choice of a future collaborator. In conclusion, this thesis shows that machine learning techniques can be used to significantly improve the quality of hyperlink data for webometrics research, and can also be used to analyse other web based data to give additional insights that may be beneficial for webometrics studies. 378.1

Search results