1 |
Massive data K-means clustering and bootstrapping via A-optimal SubsamplingZhou, Dali 08 1900 (has links)
Purdue University West Lafayette (PUWL) / For massive data analysis, the computational bottlenecks exist in two ways. Firstly, the data could be too large that it is not easy to store and read. Secondly, the computation time could be too long. To tackle these problems, parallel computing algorithms like Divide-and-Conquer were proposed, while one of its drawbacks is that some correlations may be lost when the data is divided into chunks. Subsampling is another way to simultaneously solve the problems of the massive data analysis while taking correlation into consideration. The uniform sampling is simple and fast, but it is inefficient, see detailed discussions in Mahoney (2011) and Peng and Tan (2018). The bootstrap approach uses uniform sampling and is computing time in- tensive, which will be enormously challenged when data size is massive. k-means clustering is standard method in data analysis. This method does iterations to find centroids, which would encounter difficulty when data size is massive. In this thesis, we propose the approach of optimal subsampling for massive data bootstrapping and massive data k-means clustering. We seek the sampling distribution which minimize the trace of the variance co-variance matrix of the resulting subsampling estimators. This is referred to as A-optimal in the literature. We define the optimal sampling distribution by minimizing the sum of the component variances of the subsampling estimators. We show the subsampling k-means centroids consistently approximates the full data centroids, and prove the asymptotic normality using the empirical pro- cess theory. We perform extensive simulation to evaluate the numerical performance of the proposed optimal subsampling approach through the empirical MSE and the running times. We also applied the subsampling approach to real data. For massive data bootstrap, we conducted a large simulation study in the framework of the linear regression based on the A-optimal theory proposed by Peng and Tan (2018). We focus on the performance of confidence intervals computed from A-optimal sub- sampling, including coverage probabilities, interval lengths and running times. In both bootstrap and clustering we compared the A-optimal subsampling with uniform subsampling.
|
2 |
PREDICTING HYDRAULIC RESPONSE: COMPARISON OF TEXTURAL AND RESPONSE CLUSTERING APPROACHES TO SOIL CLASSIFICATIONRice, Amy Katherine January 2009 (has links)
Traditional soil classification methods invoke physical differences based on particle size to group soils into textural classes. Resulting groupings are used to make predictions about soil attributes and processes of interest including hydrologic response. My hypothesis is that more useful classification schemes will be created by starting with response and applying an inverse approach to generate soil groupings. I propose an alternative classification scheme based on these hypotheses, using techniques of cluster analysis. The resulting system has high predictive capacity with simplicity comparable to the U.S. Dept. of Agriculture soil textural triangle or other similar classification diagrams. I conclude that: classification is most appropriate when carried out on process and objective specific bases; there is a physical meaning to cluster-based groupings, which allows for more appropriate segregation of response as compared to textural groupings; using clusters, a small number of samples can be used to characterize the range of response.
|
3 |
Density and partition based clustering on massive threshold bounded data setsKannamareddy, Aruna Sai January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / William H. Hsu / The project explores the possibility of increasing efficiency in the clusters formed out of massive data sets which are formed using threshold blocking algorithm. Clusters thus formed are denser and qualitative. Clusters that are formed out of individual clustering algorithms alone, do not necessarily eliminate outliers and the clusters generated can be complex, or improperly distributed over the data set. The threshold blocking algorithm, a current research paper from Michael Higgins of Statistics Department on other hand, in comparison with existing algorithms performs better in forming the dense and distinctive units with predefined threshold. Developing a hybridized algorithm by implementing the existing clustering algorithms to re-cluster these units thus formed is part of this project.
Clustering on the seeds thus formed from threshold blocking Algorithm, eases the task of clustering to the existing algorithm by eliminating the overhead of worrying about the outliers. Also, the clusters thus generated are more representative of the whole. Also, since the threshold blocking algorithm is proven to be fast and efficient, we now can predict a lot more decisions from large data sets in less time. Predicting the similar songs from Million Song Data Set using such a hybridized algorithm is considered as the data set for the evaluation of this goal.
|
4 |
Small-Scale Dual Path Network for Image Classification and Machine Learning Applications to Color QuantizationMurrell, Ethan Davis 05 1900 (has links)
This thesis consists of two projects in the field of machine learning. Previous research in the OSCAR UNT lab based on KMeans color quantization is further developed and applied to individual color channels and segmented input images to explore compression rates while still maintaining high output image quality. The second project implements a small-scale dual path network for image classifiaction utilizing the CIFAR-10 dataset containing 60,000 32x32 pixel images ranging across ten categories.
|
5 |
Traditional and Deep Learning Approaches to Color Image Compression and Pattern Recognition ProblemsJaques, Lorenzo E 08 1900 (has links)
This thesis includes three separate research projects focusing on computer vision principles and deep learning pattern recognition problems. Chapter 3 entails color quantization applications using traditional Kmeans clustering techniques and random selection of color techniques within the red, green, blue (RGB) color space to maintain a high-quality image while significantly reducing image file size. Chapter 4 consists of a handwriting character recognition algorithm using backpropagation to classify 70,000 handwritten values from US Census Bureau employees and high school students. Chapter 5 proposes a novel classification technique for 109,446 unique heartbeat samples to identify areas of interest and assist medical professionals in diagnosing heart problems.
|
6 |
Facebook社群人脈網絡與粉絲頁推薦之研究 / The Study of Recommendation on Social Connections and Fan Pages on Facebook曾子洋, Tseng, Tzu Yang Unknown Date (has links)
Facebook自從在台灣推出以來,已有超過一千三百萬的使用者帳號,是最熱門的社群網站,其中蘊含了龐大的使用者資料。從使用者學歷、工作經歷和喜歡的粉絲頁中可以一定程度上地判斷出使用者的背景與喜好,若能利用分析過的資訊將使用者分群,以供交友或導向到可能喜歡的粉絲頁,就能開發潛在客戶進而掌握商機。
本研究旨在完成一個線上系統,透過Facebook上可供擷取個人的資料:學歷、工作經歷以及喜歡的粉絲頁等資訊,針對這些量化過的資訊,經Kmeans將使用者分群分類,藉以作為協同過濾式推薦。目前實驗結果將有效個人資料4417筆進行分群,以使用者喜歡的粉絲頁比例(本研究整合成48種)加上工作經歷與學歷,最終分成10群,以作為交叉推薦之憑據和延伸研究。研究過程分實驗組與對照組,實驗組是本研究推薦的10筆粉絲頁,也就是使用者與所屬群集質心比例相差較多的粉絲頁類型;對照組則是選取使用者與母體中有較多比例差距的10筆,以證明本研究的推薦模型有效。
最後由使用者針對兩組推薦結果進行滿意度評分之比較,總共收回使用者回饋68筆,實驗組與對照組的平均推薦滿意度分數分別為0.5743、0.4268,對兩者作信心水準為95%的t檢定,結果為有充分證據支持實驗組大於對照組,可證明本研究對於推薦準確性的幫助,達成本研究目的。
由此實驗可以確定在Facebook上以使用者屬性為基礎的粉絲頁與人脈推薦是有意義與價值的,也說明真實數據能應用在社群網站的研究。希冀本研究的結果能帶動其他社群網站研究朝使用真實數據去分析佐證,讓社群網站的研究結果能更貼近使用者的真實行為。 / Facebook is one of the most popular social websites in Taiwan, and it has over 13 million accounts with lots of user data. One can tell a user’s background and preference by his education, work experience, and preferred fan pages. If we direct the right user to the right fan pages by analyzing information and clustering users through recommendation or personal connections, we will be able to reach potential customers and to further business opportunities.
The goal of this study is to complete an online system to assume collaborative fan page recommendation. Base on users’ education degree, work experience and preferred fan pages, users’ background. Then use the Kmeans algorithm to cluster quantified personal information to recommend fan pages and social relationships. Currently, the result of the experiment shows 10 clusters, which contain 4417 users, and we use it as a foundation of crossing recommendation. To prove the effect of this study, we divide study into two groups, an experimental group and control group. The former one represents recommended top 10 fan pages that include the fan page types with highest difference of percentage between user’s attributes and cluster centroid; the latter one represents top 10 fan pages that include the fan page types with highest difference of percentage between users’ attributes and proportion respectively.
Finally, we use users score satisfaction for each group to compare. There are 68 pieces of feedback, and the average satisfaction scores of the experimental group and the control group are 0.5743 and 0.4268, respectively. On both a confidence level of 95% for t-test, the result shows there is more sufficient evidence to support the satisfaction of experimental group than the control group. We can prove accuracy for recommendation to achieve the goal in this study.
This experiment determines not only the fan page recommendation based on user attributes on Facebook is meaningful and valuable, but also shows real data can be used in social networking studies. We hope the results of this study can lead other social networking studies to analyze with real users’ data in order to make future study on social networking better reflect real users’ behavior.
|
7 |
Bevakning av sociala medier för marknadsanalys / Social Media Monitoring for Market AnalysisForsare Källman, Povel, Lindblom, Robin January 2019 (has links)
Målet med studien ämnar undersöka till vilken grad det går att använda modeller inom maskininlärning, i syfte att identifiera marknadstrender och ersätta nuvarande marknadsanalysmetoder. Data utvinns genom Information Extraction från svenska blogginlägg och förbehandlas med TFIDF-standarden. Vidare sker klustring av data med algoritmen kmeans. Resultatet antyder på viss potential, men att ytterligare studier för implementering av sentimentalanalys och vidare utveckling av förbehandlingsmetoder krävs för att uppnå målet. / The aim of the study is to research the extent to which models in machine learning can be used, in order to identify market trends and replace current market analysis methods. Data is extracted using Information Extraction from Swedish blog posts and pre-processed with the TF-IDF standard. Furthermore, clustering of data is performed with the algorithm kmeans. The result indicates potential in monitoring of social media, but that further studies for implementation of sentimental analysis and further development of pre-processing methods are required to achieve the goal.
|
8 |
An application of topic modeling algorithms to text analytics in business intelligenceAlsadhan, Majed January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Doina Caragea / William H. Hsu / In this work, we focus on the task of clustering businesses in the state of Kansas based on the content of their websites and their business listing information. Our goal is to cluster the businesses and overcome the challenges facing current approaches such as: data noise, low number of clustered businesses, and lack of evaluation approach. We propose an LSA-based approach to analyze the businesses’ data and cluster those businesses by using Bisecting K-Means algorithm. In this approach, we analyze the businesses’ data by using LSA and produce businesses’ representations in a reduced space. We then use the businesses’ representations to cluster the businesses by applying the Bisecting K-Means algorithm. We also apply an existing LDA-based approach to cluster the businesses and compare the results with our proposed LSA-based approach at the end. In this work, we evaluate the results by using a human-expert-based evaluation procedure. At the end, we visualize the clusters produced in this work by using Google Earth and Tableau.
According to our evaluation procedure, the LDA-based approach performed slightly bet- ter then the LSA-based approach. However, with the LDA-based approach, there were some limitations which are: low number of clustered businesses, and not being able to produce a hierarchical tree for the clusters. With the LSA-based approach, we were able to cluster all the businesses and produce a hierarchical tree for the clusters.
|
9 |
Agrupamento em dois níveis para disseminação de mensagens em Redes Sociais Móveis tolerantes a atrasos e desconexõesNeves, Eric Vieira das, 92984080331 18 October 2018 (has links)
Submitted by Eric das Neves (evndeveloper@hotmail.com) on 2018-12-11T00:13:26Z
No. of bitstreams: 3
Dissertação Versão Secretaria com ficha catalografica e folha de aprovação.pdf: 3156070 bytes, checksum: 27dc73ab25f883f2c6bd2faed11dd6e6 (MD5)
316 ATA de Defesa - Eric Vieira (Assinado).pdf: 478934 bytes, checksum: 8297a92876a93d2002070fab8e573ab2 (MD5)
carta deposito.jpg: 881630 bytes, checksum: 1f98e977c3f922663009b11ea05b470a (MD5) / Approved for entry into archive by Secretaria PPGI (secretariappgi@icomp.ufam.edu.br) on 2018-12-12T01:14:34Z (GMT) No. of bitstreams: 3
Dissertação Versão Secretaria com ficha catalografica e folha de aprovação.pdf: 3156070 bytes, checksum: 27dc73ab25f883f2c6bd2faed11dd6e6 (MD5)
316 ATA de Defesa - Eric Vieira (Assinado).pdf: 478934 bytes, checksum: 8297a92876a93d2002070fab8e573ab2 (MD5)
carta deposito.jpg: 881630 bytes, checksum: 1f98e977c3f922663009b11ea05b470a (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2018-12-12T13:54:35Z (GMT) No. of bitstreams: 3
Dissertação Versão Secretaria com ficha catalografica e folha de aprovação.pdf: 3156070 bytes, checksum: 27dc73ab25f883f2c6bd2faed11dd6e6 (MD5)
316 ATA de Defesa - Eric Vieira (Assinado).pdf: 478934 bytes, checksum: 8297a92876a93d2002070fab8e573ab2 (MD5)
carta deposito.jpg: 881630 bytes, checksum: 1f98e977c3f922663009b11ea05b470a (MD5) / Made available in DSpace on 2018-12-12T13:54:35Z (GMT). No. of bitstreams: 3
Dissertação Versão Secretaria com ficha catalografica e folha de aprovação.pdf: 3156070 bytes, checksum: 27dc73ab25f883f2c6bd2faed11dd6e6 (MD5)
316 ATA de Defesa - Eric Vieira (Assinado).pdf: 478934 bytes, checksum: 8297a92876a93d2002070fab8e573ab2 (MD5)
carta deposito.jpg: 881630 bytes, checksum: 1f98e977c3f922663009b11ea05b470a (MD5)
Previous issue date: 2018-10-18 / Delay Tolerant Networks have emerged as a solution for communication in scenarios
where the Internet does not have its basic premises met for proper operation. DTNs rely
directly on their nodes for good performance because it uses the mobility of nodes to
send messages to their destinations. However, due to factors such as resource economy,
lack of interest in the message or simply the denial of collaboration, it negatively affects
network performance. In this way, it is fundamental to consider the social factors,
extended from the users to the network nodes, so that the best forwarding strategy is
found, increasing the chances of delivering the messages. In this work a new protocol is
proposed for the dissemination of messages in DTN networks, using the interests of the
network nodes for the formation of level’s message passing. These levels are formed by
grouping the nodes through machine learning techniques, KMEANS and EM clusters,
according to the level of interest of the nodes by the message generated, directly or
indirectly, passing the message by the groups formed up to the recipient of the node.
Our results show that the proposal is promising, having superior results to the protocols
well qualified in the literature. / As redes tolerantes a atrasos e desconexões surgiram como uma solução para a
comunicação em cenários onde a internet não possui suas premissas básicas atendidas:
conexão fim-afim, baixa latência e pouca perda de pacotes. As DTNs dependem
diretamente da colaboração de seus nós para um bom desempenho, pois usa a
mobilidade dos mesmos para fazer os repasses das mensagens até seus destinos.
Entretanto, devido a fatores como economia de recursos: energia, armazenamento de
dados, pouco interesse pela mensagem ou simplesmente a negação da colaboração,
afeta de forma negativa o desempenho da rede. Dessa maneira é fundamental que
se leve em consideração fatores sociais, que podem ser estendidos dos usuários e
empregados aos nós da rede, para que se possa encontrar a melhor estratégia de
repasses, aumentando as chances de entrega das mensagens. Portanto, este trabalho
propõe um novo protocolo de disseminação de mensagens em redes DTN, usando os
interesses dos nós da rede como fator social, para a formação de níveis de repasse de
mensagens. Utilizou-se técnicas de aprendizagem de máquina para a formação dos
níveis, que usam clusterizadores como o KMEANS e EM para agrupar os nós de acordo
com o nível de interesse pela mensagem gerada, direto ou indireto. Dessa forma a
mensagem é repassada através dos grupos formados até o nó destinatário. Resultados
obtidos através de um conjunto de experimentos criteriosamente selecionados mostram
que a proposta é promissora, apresentando um desempenho superior aos protocolos
bem conhecidos da literatura.
|
10 |
Τεχνικές και μηχανισμοί συσταδοποίησης χρηστών και κειμένων για την προσωποποιημένη πρόσβαση περιεχομένου στον Παγκόσμιο ΙστόΤσόγκας, Βασίλειος 16 April 2015 (has links)
Με την πραγματικότητα των υπέρογκων και ολοένα αυξανόμενων πηγών κειμένου στο διαδίκτυο, καθίστανται αναγκαία η ύπαρξη μηχανισμών οι οποίοι βοηθούν τους χρήστες ώστε να λάβουν γρήγορες απαντήσεις στα ερωτήματά τους. Η δημιουργία περιεχομένου, προσωποποιημένου στις ανάγκες των χρηστών, κρίνεται απαραίτητη σύμφωνα με τις επιταγές της συνδυαστικής έκρηξης της πληροφορίας που είναι ορατή σε κάθε ``γωνία'' του διαδικτύου. Ζητούνται άμεσες και αποτελεσματικές λύσεις ώστε να ``τιθασευτεί'' αυτό το χάος πληροφορίας που υπάρχει στον παγκόσμιο ιστό, λύσεις που είναι εφικτές μόνο μέσα από ανάλυση των προβλημάτων και εφαρμογή σύγχρονων μαθηματικών και υπολογιστικών μεθόδων για την αντιμετώπισή τους.
Η παρούσα διδακτορική διατριβή αποσκοπεί στο σχεδιασμό, στην ανάπτυξη και τελικά στην αξιολόγηση μηχανισμών και καινοτόμων αλγορίθμων από τις περιοχές της ανάκτησης πληροφορίας, της επεξεργασίας φυσικής γλώσσας καθώς και της μηχανικής εκμάθησης, οι οποίοι θα παρέχουν ένα υψηλό επίπεδο φιλτραρίσματος της πληροφορίας του διαδικτύου στον τελικό χρήστη. Πιο συγκεκριμένα, στα διάφορα στάδια επεξεργασίας της πληροφορίας αναπτύσσονται τεχνικές και μηχανισμοί που συλλέγουν, δεικτοδοτούν, φιλτράρουν και επιστρέφουν κατάλληλα στους χρήστες κειμενικό περιεχόμενο που πηγάζει από τον παγκόσμιο ιστό. Τεχνικές και μηχανισμοί που σκοπό έχουν την παροχή υπηρεσιών πληροφόρησης πέρα από τα καθιερωμένα πρότυπα της υφιστάμενης κατάστασης του διαδικτύου.
Πυρήνας της διδακτορικής διατριβής είναι η ανάπτυξη ενός μηχανισμού συσταδοποίησης (clustering) τόσο κειμένων, όσο και των χρηστών του διαδικτύου. Στο πλαίσιο αυτό μελετήθηκαν κλασικοί αλγόριθμοι συσταδοποίησης οι οποίοι και αξιολογήθηκαν για την περίπτωση των άρθρων νέων προκειμένου να εκτιμηθεί αν και πόσο αποτελεσματικός είναι ο εκάστοτε αλγόριθμος.
Σε δεύτερη φάση υλοποιήθηκε αλγόριθμος συσταδοποίησης άρθρων νέων που αξιοποιεί μια εξωτερική βάση γνώσης, το WordNet, και είναι προσαρμοσμένος στις απαιτήσεις των άρθρων νέων που πηγάζουν από το διαδίκτυο.
Ένας ακόμη βασικός στόχος της παρούσας εργασίας είναι η μοντελοποίηση των κινήσεων που ακολουθούν κοινοί χρήστες καθώς και η αυτοματοποιημένη αξιολόγηση των συμπεριφορών, με ορατό θετικό αποτέλεσμα την πρόβλεψη των προτιμήσεων που θα εκφράσουν στο μέλλον οι χρήστες. Η μοντελοποίηση των χρηστών έχει άμεση εφαρμογή στις δυνατότητες προσωποποίησης της πληροφορίας με την πρόβλεψη των προτιμήσεων των χρηστών. Ως εκ' τούτου, υλοποιήθηκε αλγόριθμος προσωποποίησης ο οποίος λαμβάνει υπ' όψιν του πληθώρα παραμέτρων που αποκαλύπτουν έμμεσα τις προτιμήσεις των χρηστών. / With the reality of the ever increasing information sources from the internet, both in sizes and indexed content, it becomes necessary to have methodologies that will assist the users in order to get the information they need, exactly the moment they need it. The delivery of content, personalized to the user needs is deemed as a necessity nowadays due to the combinatoric explosion of information visible to every corner of the world wide web. Solutions effective and swift are desperately needed in order to deal with this information overload. These solutions are achievable only via the analysis of the refereed problems, as well as the application of modern mathematics and computational methodologies.
This Ph.d. dissertation aims to the design, development and finally to the evaluation of mechanisms, as well as, novel algorithms from the areas of information retrieval, natural language processing and machine learning. These mechanisms shall provide a high level of filtering capabilities regarding information originating from internet sources and targeted to end users. More precisely, through the various stages of information processing, various techniques are proposed and developed. Techniques that will gather, index, filter and return textual content well suited to the user tastes. These techniques and mechanisms aim to go above and beyond the usual information delivery norms of today, dealing via novel means with several issues that are discussed.
The kernel of this Ph.d. dissertation is the development of a clustering mechanism that will operate both on news articles, as well as, users of the web. Within this context several classical clustering algorithms were studied and evaluated for the case of news articles, allowing as to estimate the level of efficiency of each one within this domain of interest. This left as with a clear choice as to which algorithm should be extended for our work.
As a second phase, we formulated a clustering algorithm that operates on news articles and user profiles making use of the external knowledge base of WordNet. This algorithm is adapted to the requirements of diversity and quick churn of news articles originating from the web.
Another central goal of this Ph.d. dissertation is the modeling of the browsing behavior of system users within the context of our recommendation system, as well as, the automatic evaluation of these behaviors with the obvious desired outcome or predicting the future preferences of users. The user modeling process has direct application upon the personalization capabilities that we can over on information as far as user preferences predictions are concerned. As a result, a personalization algorithm we formulated which takes into consideration a plethora or parameters that indirectly reveal the user preferences.
|
Page generated in 0.0371 seconds