Global ETD Search

1	An application of topic modeling algorithms to text analytics in business intelligence Alsadhan, Majed January 1900 (has links) Master of Science / Department of Computing and Information Sciences / Doina Caragea / William H. Hsu / In this work, we focus on the task of clustering businesses in the state of Kansas based on the content of their websites and their business listing information. Our goal is to cluster the businesses and overcome the challenges facing current approaches such as: data noise, low number of clustered businesses, and lack of evaluation approach. We propose an LSA-based approach to analyze the businesses’ data and cluster those businesses by using Bisecting K-Means algorithm. In this approach, we analyze the businesses’ data by using LSA and produce businesses’ representations in a reduced space. We then use the businesses’ representations to cluster the businesses by applying the Bisecting K-Means algorithm. We also apply an existing LDA-based approach to cluster the businesses and compare the results with our proposed LSA-based approach at the end. In this work, we evaluate the results by using a human-expert-based evaluation procedure. At the end, we visualize the clusters produced in this work by using Google Earth and Tableau. According to our evaluation procedure, the LDA-based approach performed slightly bet- ter then the LSA-based approach. However, with the LDA-based approach, there were some limitations which are: low number of clustered businesses, and not being able to produce a hierarchical tree for the clusters. With the LSA-based approach, we were able to cluster all the businesses and produce a hierarchical tree for the clusters. LSA LDA Clustering Businesses Bisecting Kmeans Computer Science (0984) Economics (0501)
2	ALGORITMOS DE CLUSTERING PARALELOS EN SISTEMAS DE RECUPERACIÓN DE INFORMACIÓN DISTRIBUIDOS Jiménez González, Daniel 20 July 2011 (has links) La información es útil si cuando se necesita está disponible y se puede hacer uso de ella. La disponibilidad suele darse fácilmente cuando la información está bien estructurada y ordenada, y además, no es muy extensa. Pero esta situación no es la más común, cada vez se tiende más a que la cantidad de información ofrecida crezca de forma desmesurada, que esté desestructurada y que no presente un orden claro. La estructuración u ordenación manual es inviable debido a las dimensiones de la información a manejar. Por todo ello se hace clara la utilidad, e incluso la necesidad, de buenos sistemas de recuperación de información (SRI). Además, otra característica también importante es que la información tiende a presentarse de forma natural de manera distribuida, lo cual implica la necesidad de SRI que puedan trabajar en entornos distribuidos y con técnicas de paralelización. Esta tesis aborda todos estos aspectos desarrollando y mejorando métodos que permitan obtener SRI con mejores prestaciones, tanto en calidad de recuperación como en eficiencia computacional, los cuales además permiten trabajar desde el enfoque de sistemas ya distribuidos. El principal objetivo de los SRI será proporcionar documentos relevantes y omitir los considerados irrelevantes respecto a una consulta dada. Algunos de los problemas más destacables de los SRI son: la polisemia y la sinonimia; las palabras relacionadas (palabras que juntas tienen un signi cado y separadas otro); la enormidad de la información a manejar; la heterogeneidad de los documentos; etc. De todos ellos esta tesis se centra en la polisemia y la sinonimia, las palabras relacionadas (indirectamente mediante la lematización semántica) y en la enormidad de la información a manejar. El desarrollo de un SRI comprende básicamente cuatro fases distintas: el preprocesamiento, la modelización, la evaluación y la utilización. El preprocesamiento que conlleva las acciones necesarias para transformar los documentos de la colección en una estructura de datos con la información relevante de los documentos ha sido una parte importante del estudio de esta tesis. En esta fase nos hemos centrado en la reducción de los datos y estructuras a manejar, maximizando la información contenida. La modelización, ha sido la fase más analizada y trabajada en esta tesis, es la que se encarga de defi nir la estructura y comportamiento del SRI. / Jiménez González, D. (2011). ALGORITMOS DE CLUSTERING PARALELOS EN SISTEMAS DE RECUPERACIÓN DE INFORMACIÓN DISTRIBUIDOS [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/11234 / Palancia Paralelo Distribuido Recuperación de información Clustering Dbscan Paralelismo Bisecting K-means Vdbscan Heuristica
3	High Performance Text Document Clustering Li, Yanjun 13 June 2007 (has links) No description available. Document Clustering Text Mining K-means Bisecting K-means Algorithm Performance Analysis.
4	Clustering the Web : Comparing Clustering Methods in Swedish / Webbklustring : En jämförelse av klustringsmetoder på svenska Hinz, Joel January 2013 (has links) Clustering -- automatically sorting -- web search results has been the focus of much attention but is by no means a solved problem, and there is little previous work in Swedish. This thesis studies the performance of three clustering algorithms -- k-means, agglomerative hierarchical clustering, and bisecting k-means -- on a total of 32 corpora, as well as whether clustering web search previews, called snippets, instead of full texts can achieve reasonably decent results. Four internal evaluation metrics are used to assess the data. Results indicate that k-means performs worse than the other two algorithms, and that snippets may be good enough to use in an actual product, although there is ample opportunity for further research on both issues; however, results are inconclusive regarding bisecting k-means vis-à-vis agglomerative hierarchical clustering. Stop word and stemmer usage results are not significant, and appear to not affect the clustering by any considerable magnitude. clustering web search results snippets k-means agglomerative hierarchical clustering bisecting k-means swedish Human Computer Interaction

1

Page generated in 0.0471 seconds