<p>Klaster analiza ima dugu istoriju i mada se<br />primenjuje u mnogim oblastima i dalje ostaju<br />značajni izazovi. U disertaciji je prikazan uvod<br />u neglatki optimizacioni pristup u<br />klasterovanju, sa osvrtom na problem<br />klasterovanja velikih skupova podataka.<br />Međutim, ovi optimizacioni algoritmi bolje<br />funkcionišu u radu sa neprekidnim podacima.<br />Jedan od glavnih izazova u klaster analizi je<br />rad sa velikim skupovima podataka sa<br />kategorijalnim i kombinovanim (numerički i<br />kategorijalni) tipovima promenljivih. Rad sa<br />velikim brojem instanci (objekata) i velikim<br />brojem dimenzija (promenljivih), može<br />predstavljati problem u klaster analizi, zbog<br />vremenske složenosti. Jedan od načina<br />rešavanja ovog problema je redukovanje broja<br />instanci, bez gubitka informacija.<br />Prvi cilj disertacije je bio upoređivanje<br />rezultata klasterovanja na celom skupu i<br />prostim slučajnim uzorcima sa kategorijalnim i<br />kombinovanim podacima, za različite veličine<br />uzorka i različit broj klastera. Nije utvrđena<br />značajna razlika (p>0.05) u rezultatima<br />klasterovanja na uzorcima obima<br />0.03m,0.05m,0.1m,0.3m (gde je m obim<br />posmatranog skupa) i celom skupu.<br />Drugi cilj disertacije je bio konstrukcija<br />efikasnog postupka klasterovanja velikih<br />skupova podataka sa kategorijalnim i<br />kombinovanim tipovima promenljivih.<br />Predloženi postupak se sastoji iz sledećih<br />koraka: 1. klasterovanje na prostim slučajnim<br />uzorcima određene kardinalnosti; 2.<br />određivanje najboljeg klasterskog rešenja na<br />uzorku, primenom odgovarajućeg kriterijuma<br />validnosti; 3. dobijeni centri klastera iz ovog<br />uzorka služe za klasterovanje ostatka skupa.<br />Treći cilj disertacije predstavlja primenu<br />klaster analize u definisanju klastera<br />bihejvioralnih faktora rizika u populaciji<br />odraslog stanovništva Srbije, kao i analizu<br />sociodemografskih karakteristika dobijenih<br />klastera. Klaster analiza je primenjena na<br />velikom reprezentativnom uzorku odraslog<br />stanovništva Srbije, starosti 20 i više godina.<br />Izdvojeno je pet jasno odvojenih klastera sa<br />karakterističnim kombinacijama bihejvioralnih<br />faktora rizika: Bez rizičnih faktora, Štetna<br />upotreba alkohola i druge rizične navike,<br />Nepravilna ishrana i druge rizične navike,<br />Nedovoljna fizička aktivnost, Pušenje. Rezultati<br />multinomnog logističkog regresionog modela<br />ukazuju da ispitanici koji nisu u braku, lošijeg<br />su materijalnog stanja, nižeg obrazovanja i žive<br />u Vojvodini imaju veću šansu za prisustvo<br />višestrukih bihejvioralnih faktora rizika.</p> / <p>The cluster analysis has a long history and a<br />large number of clustering techniques have<br />been developed in many areas, however,<br />significant challenges still remain. In this<br />thesis we have provided a introduction to<br />nonsmooth optimization approach to clustering<br />with reference to clustering large datasets.<br />Nevertheless, these optimization clustering<br />algorithms work much better when a dataset<br />contains only vectors with continuous features.<br />One of the main challenges is clustering of large<br />datasets with categorical and mixed (numerical<br />and categorical) data. Clustering deals with a<br />large number of instances (objects) and a large<br />number of dimensions (variables) can be<br />problematic because of time complexity. One of<br />the ways to solve this problem is by reducing<br />the number of instances, without the loss of<br />information.<br />The first aim of this thesis was to compare<br />the results of cluster algorithms on the whole<br />dataset and on simple random samples with<br />categorical and mixed data, in terms of validity,<br />for different number of clusters and for<br />different sample sizes. There were no<br />significant differences (p>0.05) between the<br />obtained results on the samples of the size of<br />0.03m,0.05m,0.1m,0.3m (where m is the size of<br />the dataset) and the whole dataset.<br />The second aim of this thesis was to<br />develop an efficient clustering procedure for<br />large datasets with categorical and mixed<br />(numeric and categorical) values. The proposed<br />procedure consists of the following steps: 1.<br />clustering on simple random samples of a given<br />cardinality; 2. finding the best cluster solution<br />on a sample (by appropriate validity measure);<br />3. using cluster centers from this sample for<br />clustering of the remaining data.<br />The third aim of this thesis was to<br />examine clustering of four lifestyle risk factors<br />and to examine the variation across different<br />socio-demographic groups in a Serbian adult<br />population. Cluster analysis was carried out on<br />a large representative sample of Serbian adults<br />aged 20 and over. We identified five<br />homogenous health behaviour clusters with<br />specific combination of risk factors: 'No Risk<br />Behaviours', 'Drinkers with Risk Behaviours',<br />'Unhealthy diet with Risk Behaviours',<br />'Smoking'. Results of multinomial logistic<br />regression indicated that single adults, less<br />educated, with low socio-economic status and<br />living in the region of Vojvodina are most likely<br />to be a part of the clusters with a high-risk<br />profile.</p>
Identifer | oai:union.ndltd.org:uns.ac.rs/oai:CRISUNS:(BISIS)99629 |
Date | 23 June 2016 |
Creators | Dragnić Nataša |
Contributors | Lužanin Zorana, Ač-Nikolić Eržebet, Tepavčević Andreja, Krejić Nataša, Kvrgić Svetlana, Grujić Vera |
Publisher | Univerzitet u Novom Sadu, Doktorske disertacije iz interdisciplinarne odnosno multidisciplinarne oblasti na Univerzitetu u Novom Sadu, University of Novi Sad, Doctoral dissertations in the interdisciplinary or multidisciplinary field |
Source Sets | University of Novi Sad |
Language | Serbian |
Detected Language | Unknown |
Type | PhD thesis |
Page generated in 0.0025 seconds