Global ETD Search

261	POPULATION STRUCTURE INFERENCE USING PCA AND CLUSTERING ALGORITHMS Rimal, Suraj 01 September 2021 (has links) Genotype data, consisting large numbers of markers, is used as demographic and association studies to determine genes related to specific traits or diseases. Handling of these datasets usually takes a significant amount of time in its application of population structure inference. Therefore, we suggested applying PCA on genotyped data and then clustering algorithms to specify the individuals to their particular subpopulations. We collected both real and simulated datasets in this study. We studied PCA and selected significant features, then applied five different clustering techniques to obtain better results. Furthermore, we studied three different methods for predicting the optimal number of subpopulations in a collected dataset. The results of four different simulated datasets and two real human genotype datasets show that our approach performs well in the inference of population structure. NbClust is more effective to infer subpopulations in the population. In this study, we showed that centroid-based clustering: such as k-means and PAM, performs better than model-based, spectral, and hierarchical clustering algorithms. This approach also has the benefit of being fast and flexible in the inference of population structure. Clustering Data Genotype PCA Population Structure
262	Internetové souřadnicové systémy / Internet coordinating systems Krajčír, Martin January 2009 (has links) Network coordinates (NC) system is an efficient mechanism for prediction of Internet distance with limited number of measurement. This work focus on distributed coordinates system which is evaluated by relative error. According to experimental results from simulated application, was created own algorithm to compute network coordinates. Algorithm was tested by using simulated network as well as RTT values from network PlanetLab. Experiments show that clustered nodes achieve positive results of synthetic coordinates with limited connection between nodes. This work propose implementation of own NC system in network with hierarchical aggregation. Created application was placed on research projects web page of the Department of Telecommunications.
263	Spectral methods for the detection and characterization of Topologically Associated Domains Cresswell, Kellen Garrison 01 January 2019 (has links) The three-dimensional (3D) structure of the genome plays a crucial role in gene expression regulation. Chromatin conformation capture technologies (Hi-C) have revealed that the genome is organized in a hierarchy of topologically associated domains (TADs), sub-TADs, and chromatin loops which is relatively stable across cell-lines and even across species. These TADs dynamically reorganize during development of disease, and exhibit cell- and conditionspecific differences. Identifying such hierarchical structures and how they change between conditions is a critical step in understanding genome regulation and disease development. Despite their importance, there are relatively few tools for identification of TADs and even fewer for identification of hierarchies. Additionally, there are no publicly available tools for comparison of TADs across datasets. These tools are necessary to conduct large-scale genome-wide analysis and comparison of 3D structure. To address the challenge of TAD identification, we developed a novel sliding window-based spectral clustering framework that uses gaps between consecutive eigenvectors for TAD boundary identification. Our method, implemented in an R package, SpectralTAD, has automatic parameter selection, is robust to sequencing depth, resolution and sparsity of Hi-C data, and detects hierarchical, biologically relevant TADs. SpectralTAD outperforms four state-of-the-art TAD callers in simulated and experimental settings. We demonstrate that TAD boundaries shared among multiple levels of the TAD hierarchy were more enriched in classical boundary marks and more conserved across cell lines and tissues. SpectralTAD is available at http://bioconductor.org/packages/SpectralTAD/. To address the problem of TAD comparison, we developed TADCompare. TADCompare is based on a spectral clustering-derived measure called the eigenvector gap, which enables a loci-by-loci comparison of TAD boundary differences between datasets. Using this measure, we introduce methods for identifying differential and consensus TAD boundaries and tracking TAD boundary changes over time. We further propose a novel framework for the systematic classification of TAD boundary changes. Colocalization- and gene enrichment analysis of different types of TAD boundary changes revealed distinct biological functionality associated with them. TADCompare is available on https://github.com/dozmorovlab/TADCompare. Genomics Spectral Genetics Biostatistics Statistical Clustering Biostatistics
264	Sensor Fusion Algorithm for Airborne Autonomous Vehicle Collision Avoidance Applications Doe, Julien Albert 01 December 2018 (has links) A critical ability of any aircraft is to be able to detect potential collisions with other airborne objects, and maneuver to avoid these collisions. This can be done by utilizing sensors on the aircraft to monitor the sky for collision threats. However, several problems face a system which aims to use multiple sensors for target tracking. The data collected from sensors needs to be clustered, fused, and otherwise processed such that the flight control system can make accurate decisions based on it. Raw sensor data, while filled with useful information, is tainted with inaccuracies due to limitations and imperfections of the sensor. Combined use of different sensors presents further issues in how to handle disagreements between sensor data. This thesis project tackles the problem of processing data from multiple sensors (in this application, a radar and an infrared sensor) on an airborne platform in order to allow the aircraft to make flight corrections to avoid collisions. radar sensor fusion clustering Kalman filter
265	PERFORMANCE STUDY OF SOW-AND-GROW: A NEW CLUSTERING ALGORITHM FOR BIG DATA Maier, Joshua 01 May 2020 (has links) DBSCAN is a density-based clustering algorithm that is known for being able to cluster irregular shaped clusters and can handle noise points as well. For very large sets of data, however, this algorithm becomes inefficient because it must go through each and every point and look at its neighborhood in order to determine the clusters. Also, DBSCAN is hard to implement in parallel due to the structure of the data and its sequential data access. The Sow and Grow algorithm is a parallel, density-based clustering algorithm. It utilizes a concept of growing points in order to more efficiently find clusters as opposed to going through every point in the dataset in a sequential order. We create an initial seed set of variable size based on user input and a dynamic growing points vector to cluster the data. Our algorithm is designed for shared memory and can be run in parallel using threads. For our experiments, multiple datasets were used with a varying number of points and dimensions. We used this dataset to show the significant speedup the Sow-and-Grow algorithm produces as compared to other parallel, density-based clustering algorithms. On some datasets, Sow-and-Grow achieves a speedup of 8 times faster than another density-based algorithm. We also looked at how changing the number of seeds affects the results in terms of runtime and clusters discovered. clustering density-based parallel Sow-and-Grow
266	MACHINE LEARNING BASED IDS LOG ANALYSIS Tianshuai Guan (10710258) 06 May 2021 (has links) <p>With the rapid development of information technology, network traffic is also increasing dramatically. However, many cyber-attack records are buried in this large amount of network trafficking. Therefore, many Intrusion Detection Systems (IDS) that can extract those malicious activities have been developed. Zeek is one of them, and due to its powerful functions and open-source environment, Zeek has been adapted by many organizations. Information Technology at Purdue (ITaP), which uses Zeek as their IDS, captures netflow logs for all the network activities in the whole campus area but has not delved into effective use of the information. This thesis examines ways to help increase the performance of anomaly detection. As a result, this project intends to combine basic database concepts with several different machine learning algorithms and compare the result from different combinations to better find potential attack activities in log files.</p> Pattern Recognition and Data Mining Clustering analysis IDS,
267	Spectral Density Function Estimation with Applications in Clustering and Classification Chen, Tianbo 03 March 2019 (has links) Spectral density function (SDF) plays a critical role in spatio-temporal data analysis, where the data are analyzed in the frequency domain. Although many methods have been proposed for SDF estimation, real-world applications in many research fields, such as neuroscience and environmental science, call for better methodologies. In this thesis, we focus on the spectral density functions for time series and spatial data, develop new estimation algorithms, and use the estimators as features for clustering and classification purposes. The first topic is motivated by clustering electroencephalogram (EEG) data in the spectral domain. To identify synchronized brain regions that share similar oscillations and waveforms, we develop two robust clustering methods based on the functional data ranking of the estimated SDFs. The two proposed clustering methods use different dissimilarity measures and their performance is examined by simulation studies in which two types of contaminations are included to show the robustness. We apply the methods to two sets of resting-state EEG data collected from a male college student. Then, we propose an efficient collective estimation algorithm for a group of SDFs. We use two sets of basis functions to represent the SDFs for dimension reduction, and then, the scores (the coefficients of the basis) estimated by maximizing the penalized Whittle likelihood are used for clustering the SDFs in a much lower dimension. For spatial data, an additional penalty is applied to the likelihood to encourage the spatial homogeneity of the clusters. The proposed methods are applied to cluster the EEG data and the soil moisture data. Finally, we propose a parametric estimation method for the quantile spectrum. We approximate the quantile spectrum by the ordinary spectral density of an AR process at each quantile level. The AR coefficients are estimated by solving Yule- Walker equations using the Levinson algorithm. Numerical results from simulation studies show that the proposed method outperforms other conventional smoothing techniques. We build a convolutional neural network (CNN) to classify the estimated quantile spectra of the earthquake data in Oklahoma and achieve a 99.25% accuracy on testing sets, which is 1.25% higher than using ordinary periodograms. spectral analysis functional data analysis clustering classification
268	Identifiering av områden med förhöjd olycksrisk för cyklister baserad på cykelhjälmsdata Roos, Johannes, Lindqvist, Sven January 2020 (has links) Antalet cyklister i Sverige väntas öka under kommande år, men trots stora insatser för trafiksäkerheten minskar inte antalet allvarliga cykelolyckor i samma takt som bilolyckor. Denna studie har tittat på cykelhjälm-tillverkaren Hövdings data som samlats in från deras kunder. Hjälmen fungerar som en krockkudde som löses ut vid en kraftig huvudrörelse som sker vid en olycka. Datan betsår av GPS-positioner tillsammans med ett värde från en Support Vector Machine (SVM) som indikerar hur nära en hjälm är att registrera en olycka och därmed lösas ut. Syftet med studien var att analysera denna data från cyklister i Malmö för att se om det går att identifiera platser som är överrepresenterade i antalet förhöjda SVM-nivåer, och om dessa platser speglar verkliga, potentiellt farliga trafiksituationer. Density-based spatial clustering of applications with noise (DBSCAN) användes för att identifiera kluster av förhöjda SVM-nivåer. DBSCAN är en oövervakad maskininlärningsalgoritm som ofta används för att klustra på spatial data med brusdata i datamängden. Från dessa kluster räknades antalet unika cykelturer som genererat en förhöjd SVM-nivå i klustret, samt totala antalet cykelturer som passerat genom klustret. 405 kluster identifierades och sorterades på flest unika cykelturer som genererat en förhöjd SVM-nivå, varpå de 30 översta valdes ut för närmare analys. För att validera klusterna mot registrerade cykelolyckor hämtades data från från Swedish Traffic Accident Data Acquisition (STRADA), den nationella olycksdatabasen i Sverige. De trettio utvalda klustren hade 0,082\% cykelolyckor per unik cykeltur i klustren och för resterande 375 kluster var siffran 0,041\%. Antal olyckor per kluster i de utvalda trettio klustren var 0,46 och siffran för övriga kluster var 0,064. De topp trettio klustren kategoriserades sedan i tre kategorier. De kluster som hade en eventuell förklaring till förhöjda SVM-nivåer, som farthinder och kullersten gavs kategori 1. Hövding har kommunicerat att sådana inslag i underlaget kan generera en lägre grad av förhöjd SVM-nivå. Kategori 2 var de kluster som hade haft en byggarbetsplats inom klustret. Kategori 3 var de kluster som inte kunde förklaras med något av de andra två kategorierna. Andel olyckor per unik cykeltur i kluster som tillhörde kategori 1 var 0,068\%, för kategori 2 0,071\% och kategori 3 0,106\%. Resultaten indikerar att denna data är användbar för att identifiera platser med förhöjd olycksrisk för cyklister. Datan som behandlats i denna studie har en rad svagheter i sig varpå resultaten bör tolkas med försiktigthet. Exempelvis är datamängden från en kort tidsperiod, ca 6 månader, varpå säsongsbetingat cykelbeteende inte är representerat i dataunderlaget. Det antas även förekomma en del brusdata, vilket eventuellt har påverkat resultaten. Men det finns potential i denna typ av data att i framtiden, när mer data samlats in, med större träffsäkerhet kunna identifiera olycksdrabbade platser för cyklister. / The number of cyclists in Sweden is expected to increase in the coming years, but despite major efforts in road safety, the number of serious bicycle accidents does not decrease at the same rate as car accidents.This study has looked at the data collected by the bicycle helmet manufacturer Hövding's customers. The helmet acts as an airbag that is triggered when a strong head movement occurs in the event of an accident. The data consists of GPS positions along with a Support Vector machine (SVM)- generated value which indicates how close the helmet is to registering an accident, and thus is triggered. The purpose of the study was to analyze this data from cyclists in Malmö to see if it's possible to identify places that are over-represented in the number of elevated SVM levels, and whether these sites reflect real, potentially dangerous traffic situations. Density-based spatial clustering of applications with noise (DBSCAN) was used to identify clusters of elevated SVM levels. DBSCAN is an unsupervised clustering algorithm widely used when clustering on spatial data. From these clusters, the number of unique cycle trips that generated an elevated SVM level in the cluster was calculated, as well as the total number of cycle trips that passed through each cluster. 405 clusters were identified and sorted by the highest number of unique bike rides that generated an elevated SVM level, whereupon the top 30 were selected for further analysis. In order to validate the clusters against registered bicycle accidents, data were obtained from the Swedish Traffic Accident Data Acquisition (STRADA), the national accident database in Sweden. The thirty selected clusters had 0.082 \% cycling accidents per unique cycle trip in the clusters and for the remaining 375 clusters the figure was 0.041 \%. The number of accidents per cluster in the selected thirty clusters was 0.46 and the number for the other clusters was 0.064. The top thirty clusters were then categorized into three categories. The clusters that had a possible explanation for elevated SVM levels, such as cruise barriers and cobblestones were given category 1. Hövding has communicated that such elements in the substrate can generate elevated SVM levels. Category 2 was the clusters that had a construction site within the cluster. Category 3 was the clusters that could not be explained by any of the other two categories. The proportion of accidents per unique cycle trip in clusters belonging to category 1 was 0.068 \%, for category 2 0.071 \% and for category 3 0.106 \%.The results indicate that this data is useful for identifying places with increased risk of accidents for cyclists. The data processed in this study has a number of weaknesses in itself and the results should be interpreted with caution. For example, the data is from a short period of time, about 6 months, whereby seasonal cycling behavior is not represented in the data set. The data set is also assumed to contain some noisy data, which may have affected the results. But there is potential in this type of data so that in the future, when more data is collected, it can be used to identify places with higher risk of accidents for cyclists with greater accuracy. Klustring DBSCAN clustering Engineering and Technology Teknik och teknologier
269	Efficient Detection of Overlapping Communities in Large Graphs Millson, Richard 19 January 2022 (has links) This thesis proposes an algorithm for the efficient detection of overlapping communities in large graphs. Only super-fast local algorithms like Louvain are really practical for very large datasets, but they tend to give hierarchical rather than overlapping partitions. We develop some techniques that let you get reasonable families of overlapping partitions while preserving most of the good properties of Louvain. We build off an advance in the efficient detection of separated communities, the multilevel Louvain method, and draw inspiration from the Wang-Landau efficiency improvement to Markov chain Monte Carlo sampling. Partitions are iteratively proposed by Louvain, with the internal edges of the best parts downweighted after each step. This suppresses the dominant parts in subsequent partitions, allowing alternative parts to appear. The result is an ensemble of parts describing the overlapping structure of the network. graph clustering community detection overlapping communities
270	ENHANCING FUZZY CLUSTERING METHODS FOR IMAGE SEGMENTATION USING SPATIAL INFORMATION CHEN, SHANGYE 30 April 2019 (has links) No description available. Computer Science Image segmentation FCM Fuzzy clustering

Search results