Global ETD Search

371	Unsupervised Categorical Clustering on Labor Markets Steffen, Matthew James 10 April 2023 (has links) During this "white collar recession,'' there is a flooded labor market of workers. For employers seeking to hire, there is a need to identify potential qualified candidates for each job. The current state of the art is LinkedIn Recruiting or elastic search on Resumes. The current state of the art lacks efficiency and scalability along with an intuitive ranking of candidates. We believe this can be fixed with multi-layer categorical clustering via modularity maximization. To test this, we gathered a dataset that is extensive and representative of the job market. Our data comes from PeopleDataLabs and LinkedIn and is sampled from 153 million individuals. As such, this data represents one of the most informative datasets for the task of ranking and clustering job titles and skills. Properly grouping individuals will help identify more candidates to fulfill the multitude of vacant positions. We implement a novel framework for categorical clustering, involving these attributes to deliver a reliable pool of candidates. We develop a metric for clustering based on commonality to rank clustering algorithms. The metric prefers modularity-based clustering algorithms like the Louvain algorithm. This allows us to use such algorithms to outperform other unsupervised methods for categorical clustering. Our implementation accurately clusters emergency services, health-care and other fields while managerial positions are interestingly swamped by soft or uninformative features thereby resulting in dominant ambiguous clusters. Categorical Clustering Modularity Maximization Unsupervised Clustering Louvain Sparse Categorical Data Physical Sciences and Mathematics
372	Multi-scale clustering in graphs using modularity / Multiskal-klustring i grafer med moduläritet Charpentier, Bertrand January 2019 (has links) This thesis provides a new hierarchical clustering algorithm for graphs, named Paris, which can be interpreted through the modularity score and its resolution parameter. The algorithm is agglomerative and based on a simple distance between clusters induced by the probability of sampling node pairs. It tries to approximate the optimal partitions with respect to the modularity score at any resolution in one run. In addition to the Paris hierarchical algorithm, this thesis proposes four algorithms that compute rankings of the sharpest clusters, clusterings and resolutions by processing the hierarchy output by Paris. These algorithms are based on a new measure of stability for clusterings, named sharp-score. Key outcomes of these four algorithms are the possibility to rank clusters, detect sharpest clusterings scale, go beyond the resolution limit and detect relevant resolutions. All these algorithms have been tested on both synthetic and real datasets to illustrate the efficiency of their approaches. / Denna avhandling ger en ny hierarkisk klusteralgoritm för grafer, som heter Paris, vilket kan tolkas av modularitetsresultatet och dess upplösningsparameter. Algoritmen är agglomerativ och är baserad på ett enda avstånd mellan kluster som induceras av sannolikheten för sampling av nodpar. Det försöker att approximera de optimala partitionerna vid vilken upplösning som helst i en körning. Förutom en hierarkisk algoritm föreslår denna avhandling fyra algoritmer som beräknar rankningar av de bästa grupperna, kluster och resolutioner genom att bearbeta hierarkiproduktionen i Paris. Dessa algoritmer bygger på ett nytt koncept av klusterstabilitet, kallad sharpscore. Viktiga resultat av dessa fyra algoritmer är förmågan att rangordna kluster, upptäcka bästa klusterskala, gå utöver upplösningsgränsen och upptäcka de mest relevanta resolutionerna. Alla dessa algoritmer har testats på både syntetiska och verkliga datamängder för att illustrera effektiviteten i deras metoder. Hierarchical clustering Multi-scale clustering Graph Modularity Resolution Dendrogram Computer Sciences Datavetenskap (datalogi)
373	Domain Expertise–Agnostic Feature Selection for the Analysis of Breast Cancer Data Pozzoli, Susanna January 2019 (has links) At present, high-dimensional data sets are becoming more and more frequent. The problem of feature selection has already become widespread, owing to the curse of dimensionality. Unfortunately, feature selection is largely based on ground truth and domain expertise. It is possible that ground truth and/or domain expertise will be unavailable, therefore there is a growing need for unsupervised feature selection in multiple fields, such as marketing and proteomics.Now, unlike in past time, it is possible for biologists to measure the amount of protein in a cancer cell. No wonder the data is high-dimensional, the human body is composed of thousands and thousands of proteins. Intuitively, only a handful of proteins cause the onset of the disease. It might be desirable to cluster the cancer sufferers, but at the same time we want to find the proteins that produce good partitions.We hereby propose a methodology designed to find the features able to maximize the clustering performance. After we divided the proteins into different groups, we clustered the patients. Next, we evaluated the clustering performance. We developed a couple of pipelines. Whilst the first focuses its attention on the data provided by the laboratory, the second takes advantage both of the external data on protein complexes and of the internal data. We set the threshold of clustering performance thanks to the biologists at Karolinska Institutet who contributed to the project.In the thesis we show how to make a good selection of features without domain expertise in case of breast cancer data. This experiment illustrates how we can reach a clustering performance up to eight times better than the baseline with the aid of feature selection. / Högdimensionella dataseter blir allt vanligare. Problemet med funktionsval har redan blivit utbrett på grund av dimensionalitetens förbannelse. Dessvärre är funktionsvalet i stor utsträckning baserat på grundläggande sanning och domänkunskap. Det är möjligt att grundläggande sanning och/eller domänkunskap kommer att vara otillgänglig, därför finns det ett växande behov av icke-övervakat funktionsval i flera områden, såsom marknadsföring och proteomics.I nuläge, till skillnad från tidigare, är det möjligt för biologer att mäta mängden protein i en cancercell. Inte undra på att data är högdimensionella, människokroppen består av tusentals och tusentals proteiner. Intuitivt orsakar bara en handfull proteiner sjukdomsuppkomsten. Det kan vara önskvärt att klustrera cancerlidarna, men samtidigt vill vi hitta proteiner som producerar goda partitioner.Vi föreslår härmed en metod som är utformad för att hitta funktioner som kan maximera klustringsprestandan. Efter att vi delat proteinerna i olika grupper klustrade vi patienterna. Därefter utvärderade vi klustringsprestandan. Vi utvecklade ett par pipelines. Medan den första fokuserar på de data som laboratoriet tillhandahåller, utnyttjar den andra både extern data på proteinkomplex och intern data. Vi ställde gränsen för klusterprestationen tack vare biologerna vid Karolinska Institutet som bidragit till projektet.I avhandlingen visar vi hur man gör ett bra utbud av funktioner utan domänkompetens vid bröstcancerdata. Detta experiment illustrerar hur vi kan nå en klusterprestation upp till åtta gånger bättre än baslinjen med hjälp av funktionsval. breast cancer clustering clustering performance evaluation feature selection proteomics unsupervised learning Computer and Information Sciences Data- och informationsvetenskap
374	Dynamic Classification Using the Adaptive Competitive Algorithm Deldadehasl, maryam 01 December 2023 (has links) (PDF) The Vector Quantization (VQ) model proposes a powerful solution for data clustering. Its design indicates a specific combination of concepts from machine learning and dynamical systems theory to classify input data into distinct groups. The model evolves over time to better match the distribution of the input data. This adaptive feature is a strength of the model, as it allows the cluster centers to shift according to the input patterns, effectively quantizing the data distribution. It is a gradient dynamical system, using the energy function V as its Lyapunov function, and thus possesses properties of convergence and stability. These characteristics make the VQ model a promising tool for complex data analysis tasks, including those encountered in machine learning, data mining, and pattern recognition.In this study, we have applied the dynamic model to the "Breast Cancer Wisconsin Diagnostic" dataset, a comprehensive collection of features derived from digitized images of fine needle aspirate (FNA) of breast masses. This dataset, comprising various diagnostic measurements related to breast cancer, poses a unique challenge for clustering due to its high dimensionality and the critical nature of its application in medical diagnostics. By employing the model, we aim to demonstrate its efficacy in handling complex, multidimensional data, especially in the realm of medical pattern recognition and data mining. This integration not only highlights the model's versatility in different domains but also showcases its potential in contributing significantly to medical diagnostics, particularly in breast cancer identification and classification. Neural networks real-time clustering Vector Quantization
375	A Power Iteration Based Co-Training Approach to Achieve Convergence for Multi-View Clustering Yallamelli, Pavankalyan January 2017 (has links) No description available. Computer Science Social Research Co-training Multi-View clustering Power iteration clustering Wisdom of Crowd Fantasy Sports
376	Pattern Recognition in Large Dimensional and Structured Datasets Kurra, Goutham 11 March 2002 (has links) No description available. Computer Science feature selection partial profile clustering pattern recognition clustering structured data gene expression data analysis
377	Bayesian Infinite Mixture Models for Gene Clustering and Simultaneous Context Selection Using High-Throughput Gene Expression Data Freudenberg, Johannes M. January 2009 (has links) No description available. Bioinformatics Bayesian infinite mixture models clustering differential co-expression functional clustering analysis gene expression context specificity
378	Clustering of Multi-Domain Information Networks Alqadah, Faris 09 July 2010 (has links) No description available. Computer Science Data Mining Clustering Information Networks Subspace Clusters Bi-Clustering Formal Concept Analysis
379	Non-parametric Clustering and Topic Modeling via Small Variance Asymptotics with Local Search Singh, Siddharth January 2013 (has links) No description available. Computer Science Clustering Topic Modeling Non-parametric clustering Local search Merge Parameter selection
380	Density Based Clustering using Mutual K-Nearest Neighbors Dixit, Siddharth January 2015 (has links) No description available. Computer Science Density based clustering K-nearest neighbor Mutual K-nearest neighbor Clustering

Search results