Global ETD Search

111	Clustering in Swedish : The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method Rosell, Magnus January 2005 (has links) <p>Text clustering divides a set of texts into groups, so that texts within each group are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on known ones. The contributions of this thesis are an investigation of text representation for Swedish and an evaluation method that uses two or more manual categorizations.</p><p>Text clustering, at least such as it is treated here, is performed using the vector space model, which is commonly used in information retrieval. This model represents texts by the words that appear in them and considers texts similar in content if they share many words. Languages differ in what is considered a word. We have investigated the impact of some of the characteristics of Swedish on text clustering. Since Swedish has more morphological variation than for instance English we have used a stemmer to strip suffixes. This gives moderate improvements and reduces the number of words in the representation.</p><p>Swedish has a rich production of solid compounds. Most of the constituents of these are used on their own as words and in several different compounds. In fact, Swedish solid compounds often correspond to phrases or open compounds in other languages.In the ordinary vector space model the constituents of compounds are not accounted for when calculating the similarity between texts. To use them we have employed a spell checking program to split compounds. The results clearly show that this is beneficial.</p><p>The vector space model does not regard word order. We have tried to extend it with nominal phrases in different ways. Noneof our experiments have shown any improvement over using the ordinary model.</p><p>Evaluation of text clustering results is very hard. What is a good partition of a text set is inherently subjective. Automatic evaluation methods are either intrinsic or extrinsic. Internal quality measures use the representation in some manner. Therefore they are not suitable for comparisons of different representations.</p><p>External quality measures compare a clustering with a (manual) categorization of the same text set. The theoretical best possible value for a measure is known, but it is not obvious what a good value is -- text sets differ in difficulty to cluster and categorizations are more or less adapted to a particular text set. We describe an evaluation method for cases where a text set has more than one categorization. In such cases the result of a clustering can be compared with the result for one of the categorizations, which we assume is a good partition. We also describe the kappa coefficient as a clustering quality measure in the same setting.</p> / <p>Textklustring delar upp en mängd texter i grupper, så att texterna inom dessa liknar varandra till innehåll. Man kan använda textklustring för att uppdaga strukturer och innehåll i okända textmängder och för att få nya perspektiv på redan kända. Bidragen i denna avhandling är en undersökning av textrepresentationer för svenska texter och en utvärderingsmetod som använder sig av två eller fler manuella kategoriseringar.</p><p>Textklustring, åtminstonde som det beskrivs här, utnyttjar sig av den vektorrumsmodell, som används allmänt inom området. I denna modell representeras texter med orden som förekommer i dem och texter som har många gemensamma ord betraktas som lika till innehåll. Vad som betraktas som ett ord skiljer sig mellan språk. Vi har undersökt inverkan av några av svenskans egenskaper på textklustring. Eftersom svenska har större morfologisk variation än till exempel engelska har vi tagit bort suffix med hjälp av en stemmer. Detta ger lite bättre resultat och minskar antalet ord i representationen.</p><p>I svenska används och skapas hela tiden fasta sammansättningar. De flesta delar av sammansättningar används som ord på egen hand och i många olika sammansättningar. Fasta sammansättningar i svenska språket motsvarar ofta fraser och öppna sammansättningar i andra språk. Delarna i sammansättningar används inte vid likhetsberäkningen i vektorrumsmodellen. För att utnyttja dem har vi använt ett rättstavningsprogram för att dela upp sammansättningar. Resultaten visar tydligt att detta är fördelaktigt</p><p>I vektorrumsmodellen tas ingen hänsyn till ordens inbördes ordning. Vi har försökt utvidga modellen med nominalfraser på olika sätt. Inga av våra experiment visar på någon förbättring jämfört med den vanliga enkla modellen.</p><p>Det är mycket svårt att utvärdera textklustringsresultat. Det ligger i sakens natur att vad som är en bra uppdelning av en mängd texter är subjektivt. Automatiska utvärderingsmetoder är antingen interna eller externa. Interna kvalitetsmått utnyttjar representationen på något sätt. Därför är de inte lämpliga att använda vid jämförelser av olika representationer.</p><p>Externa kvalitetsmått jämför en klustring med en (manuell) kategorisering av samma mängd texter. Det teoretiska bästa värdet för måtten är kända, men vad som är ett bra värde är inte uppenbart -- mängder av texter skiljer sig åt i svårighet att klustra och kategoriseringar är mer eller mindre lämpliga för en speciell mängd texter. Vi beskriver en utvärderingsmetod som kan användas då en mängd texter har mer än en kategorisering. I sådana fall kan resultatet för en klustring jämföras med resultatet för en av kategoriseringarna, som vi antar är en bra uppdelning. Vi beskriver också kappakoefficienten som ett kvalitetsmått för klustring under samma förutsättningar.</p> Document Clustering Language technology Språkteknologi
112	Μοντελοποίηση γραφημάτων σε μπλοκ Μπέκας, Σταύρος 25 May 2009 (has links) Στόχος της παρούσας διπλωματικής εργασίας είναι να παρουσιάσει τις τεχνικές και τις μεθόδους που πρέπει να ακολουθηθούν για να διαμεριστούν οι κορυφές ενός απλού γραφήματος σε ομάδες κορυφών, οι οποίες κατέχουν όμοιες ή παρόμοιες δομές σύνδεσης με άλλες ομάδες. Η διπλωματική εργασία αποτελείται από τέσσερα μέρη. Στο πρώτο μέρος της εργασίας δίνεται ο ορισμός του απλού γραφήματος, της ομαδοποίησης, του μπλοκ και του μοντέλου των μπλοκ, έννοιες οι οποίες είναι απαραίτητες για την συνέχεια. Στο δεύτερο μέρος περιγράφονται δύο τύποι ισοδυναμίας, που χρησιμοποιούνται για την ομαδοποίηση των κορυφών ενός γραφήματος, καθώς επίσης και δύο μέθοδοι για την επίτευξη της ομαδοποίησης των κορυφών. Στο τρίτο μέρος δίνεται μια πιο γενική ιδέα για την ομαδοποίηση των κορυφών ενός γραφήματος, η οποία βασίζεται σε μια από τις δυο προηγούμενες μεθόδους προσέγγισης της ομαδοποίησης. Στο τέταρτο και τελευταίο μέρος δίδεται μια επέκταση της ομαδοποίησης πάνω σε διμερή δεδομένα και παρουσιάζονται ατα αποτελέσματα, από την εφαρμογή αυτής της επέκτασης σε ένα εμπειρικό παράδειγμα. / - Μπλοκ Ομαδοποίηση 511.5 Block Clustering
113	Kalbos signalų klasterizacija / Speech signal clustering Čupajeva, Inga 11 June 2004 (has links) This work is devoted to the speech signal clustering analysis problem. The main methods of cluster analysis were reviewed in this work and clusterization algorithm based on vector quantization was created. The speaker identification experiments were performed in which dependence of identification accuracy and computational complexity from number of clusters was investigated. Informatics Signal clustering Signalų klasterizacija
114	Bayesian cluster validation Koepke, Hoyt Adam 11 1900 (has links) We propose a novel framework based on Bayesian principles for validating clusterings and present efficient algorithms for use with centroid or exemplar based clustering solutions. Our framework treats the data as fixed and introduces perturbations into the clustering procedure. In our algorithms, we scale the distances between points by a random variable whose distribution is tuned against a baseline null dataset. The random variable is integrated out, yielding a soft assignment matrix that gives the behavior under perturbation of the points relative to each of the clusters. From this soft assignment matrix, we are able to visualize inter-cluster behavior, rank clusters, and give a scalar index of the the clustering stability. In a large test on synthetic data, our method matches or outperforms other leading methods at predicting the correct number of clusters. We also present a theoretical analysis of our approach, which suggests that it is useful for high dimensional data. Clustering Cluster validation Unsupervised learning
115	Analysis of Industrial Construction activities using Knowledge Discovery Techniques Gonzalez, Carlos V. Unknown Date No description available. Industrial Clustering Pipe Modules Indicator
116	Intelligent Clustering in Wireless Sensor Networks Guderian, Robert 19 September 2012 (has links) Wireless Sensor Networks (WSNs) are networks of small devices, called motes, designed to monitor resources and report to a server. Motes are battery-powered and have very little memory to store data. To conserve power, the motes usually form clusters to coordinate their activities. In heterogeneous WSNs, the motes have different resources available to them. For example, some motes might have more powerful radios, or larger power supplies. By exploiting heterogeneity within a WSN can allow the network to stay active for longer periods of time. In WSNs, the communications between motes draw the most power. By choosing better clusterheads in the clusters to control and route messages, all motes in the network will have longer lifespans. By leveraging heterogeneity to select better clusterheads, I have developed Heterogeneous Clustering Control Protocol (HCCP). HCCP is designed to be highly robust to change and to fully utilize the resources that are currently available. WSN clustering Heterogeneous cooperative systems
117	Intelligent Clustering in Wireless Sensor Networks Guderian, Robert 19 September 2012 (has links) Wireless Sensor Networks (WSNs) are networks of small devices, called motes, designed to monitor resources and report to a server. Motes are battery-powered and have very little memory to store data. To conserve power, the motes usually form clusters to coordinate their activities. In heterogeneous WSNs, the motes have different resources available to them. For example, some motes might have more powerful radios, or larger power supplies. By exploiting heterogeneity within a WSN can allow the network to stay active for longer periods of time. In WSNs, the communications between motes draw the most power. By choosing better clusterheads in the clusters to control and route messages, all motes in the network will have longer lifespans. By leveraging heterogeneity to select better clusterheads, I have developed Heterogeneous Clustering Control Protocol (HCCP). HCCP is designed to be highly robust to change and to fully utilize the resources that are currently available. WSN clustering Heterogeneous cooperative systems
118	Chip-level and reconfigurable hardware for data mining applications Perera, Darshika Gimhani 04 May 2012 (has links) From mid-2000s, the realm of portable and embedded computing has expanded to include a wide variety of applications. Data mining is one of the many applications that are becoming common on these devices. Many of today’s data mining applications are compute and/or data intensive, requiring more processing power than ever before, thus speed performance is a major issue. In addition, embedded devices have stringent area and power requirements. At the same time manufacturing cost and time-to-market are decreasing rapidly. To satisfy the constraints associated with these devices, and also to improve the speed performance, it is imperative to incorporate some special-purpose hardware into embedded system design. In some cases, reconfigurable hardware support is desirable to provide the flexibility required in the ever-changing application environment. Our main objective is to provide chip-level and reconfigurable hardware support for data mining applications in portable, handheld, and embedded devices. We focus on the most widely used data mining tasks, clustering and classification. Our investigation on the hardware design and implementation of similarity computation (an important step in clustering/classification) illustrates that the chip-level hardware support for data mining operations is indeed a feasible and a worthwhile endeavour. Further performance gain is achieved with hardware optimizations such as parallel processing. To address the issue of limited hardware foot-print on portable and embedded devices, we investigate reconfigurable computing systems. We introduce dynamic reconfigurable hardware solutions for similarity computation using a multiplexer-based approach, and for principal component analysis (another important step in clustering/classification) using partial reconfiguration method. Experimental results are encouraging and show great potential in implementing data mining applications using reconfigurable platform. Finally, we formulate a design methodology for FPGA-based dynamic reconfigurable hardware, in order to select the most efficient FPGA-based reconfiguration method(s) for specific applications on portable and embedded devices. This design methodology can be generalized to other embedded applications and gives guidelines to the designer based on the computation model and characteristics of the application. / Graduate portable embedded hardware clustering classification
119	Filtering and clustering GPS time series for lifespace analysis Morrison, Laura May 04 April 2013 (has links) This thesis focuses on various aspects of community mobility and lifespace. Mobility is of particular interest to those working with the elderly population or patients affected by neurological diseases, such as Alzheimer's and Parkinson's diseases. One aspect of mobility is the number of “hotspots" in a person's daily (or weekly) trajectory, which represent the locations at which an individual remains for a minimum predetermined length of time. The individual demonstrates potential limited mobility if there is only one identified hotspot; the individual is more mobile if there are multiple identified hotspots. Based on GPS time series, we can use cluster analysis to identify hotspots. However, existing clustering algorithms such as k-means and trimmed k-means do not take into account the time dependencies between the location points in the series, and require knowing the number of clusters ahead of time. Thus, the resulting clusters do not represent the subjects' activity centres well. In this thesis we have developed a robust time-dependent clustering criterion that works very well to find clusters. Another aspect of mobility is the total distance travelled. The total distance computed from the original GPS data is inflated as there is noise in the data. Due to the particular characteristics of noise specific to GPS time series, we have investigated the identification of noisy segments of data as well as smoothing techniques. The average amplitude of acceleration is proposed as an appropriate method to identify the large noise that occurs in GPS data. A multi-level trimmed means smoother is proposed as an appropriate method to filter the identified large noise. Three methods were investigated to determine an ellipse that identifies the spatial area an individual purposely moves through in daily life. The classical and robust 95% ellipses contain 95% of the points, but do not necessarily capture the distinct shape of the data. The minimum spanning ellipse over the series with all points in each identified cluster reduced to each cluster's central value captures the shape of the data very well and is proposed as the most appropriate lifespace ellipse. Results are obtained and presented for the subjects available in the mobility study for the total distance travelled and a meaningful lower bound, the number of hotspots, the proportion of time spent in the hotspots, as well as the area of the classical 95% ellipse, robust 95% ellipse and minimum spanning ellipse. In the processing of the data, other problems that had to be addressed include obtaining appropriate estimates for the missing values and translating time series from degrees of longitude and latitude to metres in the Cartesian (x,y) plane. / Graduate / 0463 / lauramor@uvic.ca Clustering Filtering Lifespace Time series
120	Approach to Evaluating Clustering Using Classification Labelled Data Luu, Tuong January 2010 (has links) Cluster analysis has been identified as a core task in data mining for which many different algorithms have been proposed. The diversity, on one hand, provides us a wide collection of tools. On the other hand, the profusion of options easily causes confusion. Given a particular task, users do not know which algorithm is good since it is not clear how clustering algorithms should be evaluated. As a consequence, users often select clustering algorithm in a very adhoc manner. A major challenge in evaluating clustering algorithms is the scarcity of real data with a "correct" ground truth clustering. This is in stark contrast to the situation for classification tasks, where there are abundantly many data sets labeled with their correct classifications. As a result, clustering research often relies on labeled data to evaluate and compare the results of clustering algorithms. We present a new perspective on how to use labeled data for evaluating clustering algorithms, and develop an approach for comparing clustering algorithms on the basis of classification labeled data. We then use this approach to support a novel technique for choosing among clustering algorithms when no labels are available. We use these tools to demonstrate that the utility of an algorithm depends on the specific clustering task. Investigating a set of common clustering algorithms, we demonstrate that there are cases where each one of them outputs better clusterings. In contrast to the current trend of looking for a superior clustering algorithm, our findings demonstrate the need for a variety of different clustering algorithms. clustering empirical study Computer Science

Search results