1 |
Clustering Multiple Contextually Related Heterogeneous DatasetsHossain, Mahmood 09 December 2006 (has links)
Traditional clustering is typically based on a single feature set. In some domains, several feature sets may be available to represent the same objects, but it may not be easy to compute a useful and effective integrated feature set. We hypothesize that clustering individual datasets and then combining them using a suitable ensemble algorithm will yield better quality clusters compared to the individual clustering or clustering based on an integrated feature set. We present two classes of algorithms to address the problem of combining the results of clustering obtained from multiple related datasets where the datasets represent identical or overlapping sets of objects but use different feature sets. One class of algorithms was developed for combining hierarchical clustering generated from multiple datasets and another class of algorithms was developed for combining partitional clustering generated from multiple datasets. The first class of algorithms, called EPaCH, are based on graph-theoretic principles and use the association strengths of objects in the individual cluster hierarchies. The second class of algorithms, called CEMENT, use an EM (Expectation Maximization) approach to progressively refine the individual clusterings until the mutual entropy between them converges toward a maximum. We have applied our methods to the problem of clustering a document collection consisting of journal abstracts from ten different Library of Congress categories. After several natural language preprocessing steps, both syntactic and semantic feature sets were extracted. We present empirical results that include the comparison of our algorithms with several baseline clustering schemes using different cluster validation indices. We also present the results of one-tailed paired emph{T}-tests performed on cluster qualities. Our methods are shown to yield higher quality clusters than the baseline clustering schemes that include the clustering based on individual feature sets and clustering based on concatenated feature sets. When the sets of objects represented in two datasets are overlapping but not identical, our algorithms outperform all baseline methods for all indices.
|
2 |
Evolving Ensemble-Clustering to a Feedback-Driven ProcessLehner, Wolfgang, Habich, Dirk, Hahmann, Martin 01 November 2022 (has links)
Data clustering is a highly used knowledge extraction technique and is applied in more and more application domains. Over the last years, a lot of algorithms have been proposed that are often complicated and/or tailored to specific scenarios. As a result, clustering has become a hardly accessible domain for non-expert users, who face major difficulties like algorithm selection and parameterization. To overcome this issue, we develop a novel feedback-driven clustering process using a new perspective of clustering. By substituting parameterization with user-friendly feedback and providing support for result interpretation, clustering becomes accessible and allows the step-by-step construction of a satisfying result through iterative refinement.
|
3 |
Vehicle Usage Modelling Under Different ContextsKalia, Nidhi Rani, Bagepalli Ashwathanarayana, Sachin Bharadwaj January 2021 (has links)
Modern vehicles nowadays are equipped with highly sensitive sensors which continuously log in the information when the vehicle is in motion. These vehicles also deal with some performance issues like more fuel consumption, breakdown, or failure, etc. The information logged in by the sensors can be useful to analyze and evaluate these performance issues. As vehicles are there in the market and are used in multiple places. These vehicles can perform differently based on the way they are operated and driven and the usage of a vehicle varies from time to time. Moreover, the European Accident Research and Safety Report from Volvo Organization describes the factors responsible for road fatalities and accidents. It explains that 90\% of road fatalities are caused by the style of the vehicle being driven and 30\% is caused by the external weather and environmental factor. Therefore, in this work, vehicle usage modeling is done based on time to determine the different usage styles of a vehicle and how they can affect a vehicle's performance. The proposed framework is divided into four separate modules namely: Data pre\textendash processing, Data segmentation, Unsupervised machine learning, and Pattern Analysis. Mainly, ensemble clustering methods are used to extract the pattern of the vehicle usage style and vehicle performance in different seasons using truck logged vehicle data (LVD). From the results, we could build a strong correlation between the vehicle usage style and the vehicle performance that would require further investigation.
|
4 |
Algoritmo rápido para segmentação de vídeos utilizando agrupamento de clustersMonma, Yumi January 2014 (has links)
Este trabalho propõe um algoritmo rápido para segmentação de partes móveis em vídeo, tendo como base a detecção de volumes fechados no espaço tridimensional. O vídeo de entrada é pré-processado com um algoritmo de detecção de bordas baseado em linhas de nível para produzir os objetos. Os objetos detectados são agrupados utilizando uma combinação dos métodos de mean shift clustering e meta-agrupamento. Para diminuir o tempo de computação, somente alguns objetos e quadros são utilizados no agrupamento. Uma vez que a forma de detecção garante que os objetos persistem com o mesmo rótulo em múltiplos quadros, a seleção de quadros impacta pouco no resultado final. Dependendo da aplicação desejada os grupos podem ser refinados em uma etapa de pós-processamento. / This work presents a very fast algorithm to segmentation of moving parts in a video, based on detection of surfaces of the scene with closed contours. The input video is preprocessed with an edge detection algorithm based on level lines to produce the objects. The detected objects are clustered using a combination of mean shift clustering and ensemble clustering. In order decrease even more the computation time required, two methods can be used combined: object filtering by size and selecting only a few frames of the video. Since the detected objects are coherent in time, frame skipping does not affect the final result. Depending on the application the detected clusters can be refined using post processing steps.
|
5 |
Algoritmo rápido para segmentação de vídeos utilizando agrupamento de clustersMonma, Yumi January 2014 (has links)
Este trabalho propõe um algoritmo rápido para segmentação de partes móveis em vídeo, tendo como base a detecção de volumes fechados no espaço tridimensional. O vídeo de entrada é pré-processado com um algoritmo de detecção de bordas baseado em linhas de nível para produzir os objetos. Os objetos detectados são agrupados utilizando uma combinação dos métodos de mean shift clustering e meta-agrupamento. Para diminuir o tempo de computação, somente alguns objetos e quadros são utilizados no agrupamento. Uma vez que a forma de detecção garante que os objetos persistem com o mesmo rótulo em múltiplos quadros, a seleção de quadros impacta pouco no resultado final. Dependendo da aplicação desejada os grupos podem ser refinados em uma etapa de pós-processamento. / This work presents a very fast algorithm to segmentation of moving parts in a video, based on detection of surfaces of the scene with closed contours. The input video is preprocessed with an edge detection algorithm based on level lines to produce the objects. The detected objects are clustered using a combination of mean shift clustering and ensemble clustering. In order decrease even more the computation time required, two methods can be used combined: object filtering by size and selecting only a few frames of the video. Since the detected objects are coherent in time, frame skipping does not affect the final result. Depending on the application the detected clusters can be refined using post processing steps.
|
6 |
Algoritmo rápido para segmentação de vídeos utilizando agrupamento de clustersMonma, Yumi January 2014 (has links)
Este trabalho propõe um algoritmo rápido para segmentação de partes móveis em vídeo, tendo como base a detecção de volumes fechados no espaço tridimensional. O vídeo de entrada é pré-processado com um algoritmo de detecção de bordas baseado em linhas de nível para produzir os objetos. Os objetos detectados são agrupados utilizando uma combinação dos métodos de mean shift clustering e meta-agrupamento. Para diminuir o tempo de computação, somente alguns objetos e quadros são utilizados no agrupamento. Uma vez que a forma de detecção garante que os objetos persistem com o mesmo rótulo em múltiplos quadros, a seleção de quadros impacta pouco no resultado final. Dependendo da aplicação desejada os grupos podem ser refinados em uma etapa de pós-processamento. / This work presents a very fast algorithm to segmentation of moving parts in a video, based on detection of surfaces of the scene with closed contours. The input video is preprocessed with an edge detection algorithm based on level lines to produce the objects. The detected objects are clustered using a combination of mean shift clustering and ensemble clustering. In order decrease even more the computation time required, two methods can be used combined: object filtering by size and selecting only a few frames of the video. Since the detected objects are coherent in time, frame skipping does not affect the final result. Depending on the application the detected clusters can be refined using post processing steps.
|
7 |
Non-negative matrix factorization for integrative clustering / Алгоритми интегративног кластеровања података применом ненегативне факторизације матрице / Algoritmi integrativnog klasterovanja podataka primenom nenegativne faktorizacije matriceBrdar Sanja 15 December 2016 (has links)
<p>Integrative approaches are motivated by the desired improvement of<br />robustness, stability and accuracy. Clustering, the prevailing technique for<br />preliminary and exploratory analysis of experimental data, may benefit from<br />integration across multiple partitions. In this thesis we have proposed<br />integration methods based on non-negative matrix factorization that can fuse<br />clusterings stemming from different data sets, different data preprocessing<br />steps or different sub-samples of objects or features. Proposed methods are<br />evaluated from several points of view on typical machine learning data sets,<br />synthetics data, and above all, on data coming form bioinformatics realm,<br />which rise is fuelled by technological revolutions in molecular biology. For a<br />vast amounts of 'omics' data that are nowadays available sophisticated<br />computational methods are necessary. We evaluated methods on problem<br />from cancer genomics, functional genomics and metagenomics.</p> / <p>Предмет истраживања докторске дисертације су алгоритми кластеровања,<br />односно груписања података, и могућности њиховог унапређења<br />интегративним приступом у циљу повећања поузданости, робустности на<br />присуство шума и екстремних вредности у подацима, омогућавања фузије<br />података. У дисертацији су предложене методе засноване на ненегативној<br />факторизацији матрице. Методе су успешно имплементиране и детаљно<br />анализиране на разноврсним подацима са UCI репозиторијума и<br />синтетичким подацима које се типично користе за евалуацију нових<br />алгоритама и поређење са већ постојећим методама. Већи део<br />дисертације посвећен је примени у домену биоинформатике која обилује<br />хетерогеним подацима и бројним изазовним задацима. Евалуација је<br />извршена на подацима из домена функционалне геномике, геномике рака и<br />метагеномике.</p> / <p>Predmet istraživanja doktorske disertacije su algoritmi klasterovanja,<br />odnosno grupisanja podataka, i mogućnosti njihovog unapređenja<br />integrativnim pristupom u cilju povećanja pouzdanosti, robustnosti na<br />prisustvo šuma i ekstremnih vrednosti u podacima, omogućavanja fuzije<br />podataka. U disertaciji su predložene metode zasnovane na nenegativnoj<br />faktorizaciji matrice. Metode su uspešno implementirane i detaljno<br />analizirane na raznovrsnim podacima sa UCI repozitorijuma i<br />sintetičkim podacima koje se tipično koriste za evaluaciju novih<br />algoritama i poređenje sa već postojećim metodama. Veći deo<br />disertacije posvećen je primeni u domenu bioinformatike koja obiluje<br />heterogenim podacima i brojnim izazovnim zadacima. Evaluacija je<br />izvršena na podacima iz domena funkcionalne genomike, genomike raka i<br />metagenomike.</p>
|
8 |
Unsupervised learning of relation detection patternsGonzàlez Pellicer, Edgar 01 June 2012 (has links)
L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades
estructurades a partir de la informació rellevant continguda en fragments textuals.
L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest
coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un
cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest
coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta
la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades
per tal d'explotar el coneixement que hi ha en elles.
La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació,
per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el
problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les
diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes
de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que
incorporessin la informació de clustering.
Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de
patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i
fins i tot supera altres aproximacions comparables en l'estat de l'art. / Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant
information contained in textual fragments.
Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a
drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort.
Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively
reducing the amount of involved human supervision. However, as the availability of large document collections increases,
completely unsupervised approaches become necessary in order to mine the knowledge contained in them.
The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to
further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation
detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this
combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third,
devising pattern learning procedures which incorporated clustering information.
By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns
which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable
approaches in the state of the art.
|
9 |
Community Detection in Imperfect NetworksDahlin, Johan January 2011 (has links)
Community detection in networks is an important area of current research with many applications. Finding community structures is a challenging task and despite significant effort no satisfactory method has been found. Different methods find different communities in the same network and with different computational requirements. To counter this problem, several different methods are often used and the results compared manually. In this thesis, we present three different methods to instead merge the results from different methods (or several runs from the same algorithm) to find better estimates of the community structure. Another problem in practical applications is noisy and imperfect networks with missing and false edges. These imperfections are natural results from the methods used to map the network structure and are often difficult to eliminate. In this thesis, we apply a Monte Carlo-sampling method in combination with the introduced methods for merging community detection results to find community structures in such networks. The method is tested by simulation studies on both real-world networks and synthetic networks with generated uncertainties and imperfections. We finally demonstrate how it is possible to generate confidence levels of the obtained community structure from the merging methods. This allows for a qualitative comparison of the robustness and significance of the network clustering. / Identifikation av grupperingar i nätverk är ett viktigt område inom aktuell forskning med många olika tillämpningsområden. Att finna grupperingar är ofta svårt och trots betydande ansträngningar har ingen tillfredsställande metod hittats. Olika metoder finner ofta olika grupperingar i samma nätverk och kräver varierande beräkningskraft. För att hantera dessa problem används ofta flera metoder vartefter resultaten jämförs manuellt. I detta examensarbete presenterar vi tre olika metoder att istället slå samman resultat från olika metoder (eller fler körningar från samma algoritm) för att hitta bättre uppskattningar av grupperingarna. Ett annat problem i praktiska tillämpningar är brus och ofullständiga nätverk med saknade och falska kanter. Dessa brister är naturliga resultat från de metoder som används för att kartlägga nätverketstrukturen och det är ofta svåra att eliminera dessa. I detta examensarbete använder vi Monte Carlo-metoder i kombination med de introducerade metoderna för att slå samman funna grupperingar för att hitta grupperingar i det osäkra nätverket. Vi testar metoden genom simuleringstudier på både verkliga och syntetiska nätverk med genererade osäkerheter och brister. Slutligen demostrerar vi hur det är möjligt att skapa konfidensnivåer för noder i grupperingar med hjälp av metoderna för sammanslagning. Detta möjliggör en kvalitativ jämförelse av stabilitet och signifikans av identifierade nätverksgrupperingar.
|
10 |
Feedback-Driven Data ClusteringHahmann, Martin 28 February 2014 (has links) (PDF)
The acquisition of data and its analysis has become a common yet critical task in many areas of modern economy and research. Unfortunately, the ever-increasing scale of datasets has long outgrown the capacities and abilities humans can muster to extract information from them and gain new knowledge. For this reason, research areas like data mining and knowledge discovery steadily gain importance. The algorithms they provide for the extraction of knowledge are mandatory prerequisites that enable people to analyze large amounts of information. Among the approaches offered by these areas, clustering is one of the most fundamental. By finding groups of similar objects inside the data, it aims to identify meaningful structures that constitute new knowledge. Clustering results are also often used as input for other analysis techniques like classification or forecasting.
As clustering extracts new and unknown knowledge, it obviously has no access to any form of ground truth. For this reason, clustering results have a hypothetical character and must be interpreted with respect to the application domain. This makes clustering very challenging and leads to an extensive and diverse landscape of available algorithms. Most of these are expert tools that are tailored to a single narrowly defined application scenario. Over the years, this specialization has become a major trend that arose to counter the inherent uncertainty of clustering by including as much domain specifics as possible into algorithms. While customized methods often improve result quality, they become more and more complicated to handle and lose versatility. This creates a dilemma especially for amateur users whose numbers are increasing as clustering is applied in more and more domains. While an abundance of tools is offered, guidance is severely lacking and users are left alone with critical tasks like algorithm selection, parameter configuration and the interpretation and adjustment of results.
This thesis aims to solve this dilemma by structuring and integrating the necessary steps of clustering into a guided and feedback-driven process. In doing so, users are provided with a default modus operandi for the application of clustering. Two main components constitute the core of said process: the algorithm management and the visual-interactive interface. Algorithm management handles all aspects of actual clustering creation and the involved methods. It employs a modular approach for algorithm description that allows users to understand, design, and compare clustering techniques with the help of building blocks. In addition, algorithm management offers facilities for the integration of multiple clusterings of the same dataset into an improved solution. New approaches based on ensemble clustering not only allow the utilization of different clustering techniques, but also ease their application by acting as an abstraction layer that unifies individual parameters. Finally, this component provides a multi-level interface that structures all available control options and provides the docking points for user interaction.
The visual-interactive interface supports users during result interpretation and adjustment. For this, the defining characteristics of a clustering are communicated via a hybrid visualization. In contrast to traditional data-driven visualizations that tend to become overloaded and unusable with increasing volume/dimensionality of data, this novel approach communicates the abstract aspects of cluster composition and relations between clusters. This aspect orientation allows the use of easy-to-understand visual components and makes the visualization immune to scale related effects of the underlying data. This visual communication is attuned to a compact and universally valid set of high-level feedback that allows the modification of clustering results. Instead of technical parameters that indirectly cause changes in the whole clustering by influencing its creation process, users can employ simple commands like merge or split to directly adjust clusters.
The orchestrated cooperation of these two main components creates a modus operandi, in which clusterings are no longer created and disposed as a whole until a satisfying result is obtained. Instead, users apply the feedback-driven process to iteratively refine an initial solution. Performance and usability of the proposed approach were evaluated with a user study. Its results show that the feedback-driven process enabled amateur users to easily create satisfying clustering results even from different and not optimal starting situations.
|
Page generated in 0.1397 seconds