Global ETD Search

1	Link discovery in very large graphs by constructive induction using genetic programming Weninger, Timothy Edwards January 1900 (has links) Master of Science / Department of Computing and Information Sciences / William H. Hsu / This thesis discusses the background and methodologies necessary for constructing features in order to discover hidden links in relational data. Specifically, we consider the problems of predicting, classifying and annotating friends relations in friends networks, based upon features constructed from network structure and user profile data. I first document a data model for the blog service LiveJournal, and define a set of machine learning problems such as predicting existing links and estimating inter-pair distance. Next, I explain how the problem of classifying a user pair in a social networks, as directly connected or not, poses the problem of selecting and constructing relevant features. In order to construct these features, a genetic programming approach is used to construct multiple symbol trees with base features as their leaves; in this manner, the genetic program selects and constructs features that many not have been considered, but possess better predictive properties than the base features. In order to extract certain graph features from the relatively large social network, a new shortest path search algorithm is presented which computes and operates on a Euclidean embedding of the network. Finally, I present classification results and discuss the properties of the frequently constructed features in order to gain insight on hidden relations that exists in this domain. Link Mining Graph Theory Machine Learning Evolutionary Computation Artificial Intelligence (0800) Computer Science (0984)
2	Feature Ranking for Text Classifiers Makrehchi, Masoud January 2007 (has links) Feature selection based on feature ranking has received much attention by researchers in the field of text classification. The major reasons are their scalability, ease of use, and fast computation. %, However, compared to the search-based feature selection methods such as wrappers and filters, they suffer from poor performance. This is linked to their major deficiencies, including: (i) feature ranking is problem-dependent; (ii) they ignore term dependencies, including redundancies and correlation; and (iii) they usually fail in unbalanced data. While using feature ranking methods for dimensionality reduction, we should be aware of these drawbacks, which arise from the function of feature ranking methods. In this thesis, a set of solutions is proposed to handle the drawbacks of feature ranking and boost their performance. First, an evaluation framework called feature meta-ranking is proposed to evaluate ranking measures. The framework is based on a newly proposed Differential Filter Level Performance (DFLP) measure. It was proved that, in ideal cases, the performance of text classifier is a monotonic, non-decreasing function of the number of features. Then we theoretically and empirically validate the effectiveness of DFLP as a meta-ranking measure to evaluate and compare feature ranking methods. The meta-ranking framework is also examined by a stopword extraction problem. We use the framework to select appropriate feature ranking measure for building domain-specific stoplists. The proposed framework is evaluated by SVM and Rocchio text classifiers on six benchmark data. The meta-ranking method suggests that in searching for a proper feature ranking measure, the backward feature ranking is as important as the forward one. Second, we show that the destructive effect of term redundancy gets worse as we decrease the feature ranking threshold. It implies that for aggressive feature selection, an effective redundancy reduction should be performed as well as feature ranking. An algorithm based on extracting term dependency links using an information theoretic inclusion index is proposed to detect and handle term dependencies. The dependency links are visualized by a tree structure called a term dependency tree. By grouping the nodes of the tree into two categories, including hub and link nodes, a heuristic algorithm is proposed to handle the term dependencies by merging or removing the link nodes. The proposed method of redundancy reduction is evaluated by SVM and Rocchio classifiers for four benchmark data sets. According to the results, redundancy reduction is more effective on weak classifiers since they are more sensitive to term redundancies. It also suggests that in those feature ranking methods which compact the information in a small number of features, aggressive feature selection is not recommended. Finally, to deal with class imbalance in feature level using ranking methods, a local feature ranking scheme called reverse discrimination approach is proposed. The proposed method is applied to a highly unbalanced social network discovery problem. In this case study, the problem of learning a social network is translated into a text classification problem using newly proposed actor and relationship modeling. Since social networks are usually sparse structures, the corresponding text classifiers become highly unbalanced. Experimental assessment of the reverse discrimination approach validates the effectiveness of the local feature ranking method to improve the classifier performance when dealing with unbalanced data. The application itself suggests a new approach to learn social structures from textual data. Feature Ranking Feature Selection Information Theory Text Classifier Social Network Link Mining Electrical and Computer Engineering
3	Feature Ranking for Text Classifiers Makrehchi, Masoud January 2007 (has links) Feature selection based on feature ranking has received much attention by researchers in the field of text classification. The major reasons are their scalability, ease of use, and fast computation. %, However, compared to the search-based feature selection methods such as wrappers and filters, they suffer from poor performance. This is linked to their major deficiencies, including: (i) feature ranking is problem-dependent; (ii) they ignore term dependencies, including redundancies and correlation; and (iii) they usually fail in unbalanced data. While using feature ranking methods for dimensionality reduction, we should be aware of these drawbacks, which arise from the function of feature ranking methods. In this thesis, a set of solutions is proposed to handle the drawbacks of feature ranking and boost their performance. First, an evaluation framework called feature meta-ranking is proposed to evaluate ranking measures. The framework is based on a newly proposed Differential Filter Level Performance (DFLP) measure. It was proved that, in ideal cases, the performance of text classifier is a monotonic, non-decreasing function of the number of features. Then we theoretically and empirically validate the effectiveness of DFLP as a meta-ranking measure to evaluate and compare feature ranking methods. The meta-ranking framework is also examined by a stopword extraction problem. We use the framework to select appropriate feature ranking measure for building domain-specific stoplists. The proposed framework is evaluated by SVM and Rocchio text classifiers on six benchmark data. The meta-ranking method suggests that in searching for a proper feature ranking measure, the backward feature ranking is as important as the forward one. Second, we show that the destructive effect of term redundancy gets worse as we decrease the feature ranking threshold. It implies that for aggressive feature selection, an effective redundancy reduction should be performed as well as feature ranking. An algorithm based on extracting term dependency links using an information theoretic inclusion index is proposed to detect and handle term dependencies. The dependency links are visualized by a tree structure called a term dependency tree. By grouping the nodes of the tree into two categories, including hub and link nodes, a heuristic algorithm is proposed to handle the term dependencies by merging or removing the link nodes. The proposed method of redundancy reduction is evaluated by SVM and Rocchio classifiers for four benchmark data sets. According to the results, redundancy reduction is more effective on weak classifiers since they are more sensitive to term redundancies. It also suggests that in those feature ranking methods which compact the information in a small number of features, aggressive feature selection is not recommended. Finally, to deal with class imbalance in feature level using ranking methods, a local feature ranking scheme called reverse discrimination approach is proposed. The proposed method is applied to a highly unbalanced social network discovery problem. In this case study, the problem of learning a social network is translated into a text classification problem using newly proposed actor and relationship modeling. Since social networks are usually sparse structures, the corresponding text classifiers become highly unbalanced. Experimental assessment of the reverse discrimination approach validates the effectiveness of the local feature ranking method to improve the classifier performance when dealing with unbalanced data. The application itself suggests a new approach to learn social structures from textual data. Feature Ranking Feature Selection Information Theory Text Classifier Social Network Link Mining Electrical and Computer Engineering
4	Découverte des relations dans les réseaux sociaux / Relationship discovery in social networks Raad, Elie 22 December 2011 (has links) Les réseaux sociaux occupent une place de plus en plus importante dans notre vie quotidienne et représentent une part considérable des activités sur le web. Ce succès s’explique par la diversité des services/fonctionnalités de chaque site (partage des données souvent multimédias, tagging, blogging, suggestion de contacts, etc.) incitant les utilisateurs à s’inscrire sur différents sites et ainsi à créer plusieurs réseaux sociaux pour diverses raisons (professionnelle, privée, etc.). Cependant, les outils et les sites existants proposent des fonctionnalités limitées pour identifier et organiser les types de relations ne permettant pas de, entre autres, garantir la confidentialité des utilisateurs et fournir un partage plus fin des données. Particulièrement, aucun site actuel ne propose une solution permettant d’identifier automatiquement les types de relations en tenant compte de toutes les données personnelles et/ou celles publiées. Dans cette étude, nous proposons une nouvelle approche permettant d’identifier les types de relations à travers un ou plusieurs réseaux sociaux. Notre approche est basée sur un framework orientéutilisateur qui utilise plusieurs attributs du profil utilisateur (nom, age, adresse, photos, etc.). Pour cela, nous utilisons des règles qui s’appliquent à deux niveaux de granularité : 1) au sein d’un même réseau social pour déterminer les relations sociales (collègues, parents, amis, etc.) en exploitant principalement les caractéristiques des photos et leurs métadonnées, et, 2) à travers différents réseaux sociaux pour déterminer les utilisateurs co-référents (même personne sur plusieurs réseaux sociaux) en étant capable de considérer tous les attributs du profil auxquels des poids sont associés selon le profil de l’utilisateur et le contenu du réseau social. À chaque niveau de granularité, nous appliquons des règles de base et des règles dérivées pour identifier différents types de relations. Nous mettons en avant deux méthodologies distinctes pour générer les règles de base. Pour les relations sociales, les règles de base sont créées à partir d’un jeu de données de photos créées en utilisant le crowdsourcing. Pour les relations de co-référents, en utilisant tous les attributs, les règles de base sont générées à partir des paires de profils ayant des identifiants de mêmes valeurs. Quant aux règles dérivées, nous utilisons une technique de fouille de données qui prend en compte le contexte de chaque utilisateur en identifiant les règles de base fréquemment utilisées. Nous présentons notre prototype, intitulé RelTypeFinder, que nous avons implémenté afin de valider notre approche. Ce prototype permet de découvrir différents types de relations, générer des jeux de données synthétiques, collecter des données du web, et de générer les règles d’extraction. Nous décrivons les expériementations que nous avons menées sur des jeux de données réelles et syntéthiques. Les résultats montrent l’efficacité de notre approche à découvrir les types de relations. / In recent years, social network sites exploded in popularity and become an important part of the online activities on the web. This success is related to the various services/functionalities provided by each site (ranging from media sharing, tagging, blogging, and mainly to online social networking) pushing users to subscribe to several sites and consequently to create several social networks for different purposes and contexts (professional, private, etc.). Nevertheless, current tools and sites provide limited functionalities to organize and identify relationship types within and across social networks which is required in several scenarios such as enforcing users’ privacy, and enhancing targeted social content sharing, etc. Particularly, none of the existing social network sites provides a way to automatically identify relationship types while considering users’ personal information and published data. In this work, we propose a new approach to identify relationship types among users within either a single or several social networks. We provide a user-oriented framework able to consider several features and shared data available in user profiles (e.g., name, age, interests, photos, etc.). This framework is built on a rule-based approach that operates at two levels of granularity: 1) within a single social network to discover social relationships (i.e., colleagues, relatives, friends, etc.) by exploiting mainly photos’ features and their embedded metadata, and 2) across different social networks to discover co-referent relationships (same real-world persons) by considering all profiles’ attributes weighted by the user profile and social network contents. At each level of granularity, we generate a set of basic and derived rules that are both used to discover relationship types. To generate basic rules, we propose two distinct methodologies. On one hand, social relationship basic rules are generated from a photo dataset constructed using crowdsourcing. On the other hand, using all weighted attributes, co-referent relationship basic rules are generated from the available pairs of profiles having the same unique identifier(s) attribute(s) values. To generate the derived rules, we use a mining technique that takes into account the context of users, namely by identifying frequently used valid basic rules for each user. We present here our prototype, called RelTypeFinder, implemented to validate our approach. It allows to discover appropriately different relationship types, generate synthetic datesets, collect web data and photo, and generate mining rules. We also describe here the sets of experiments conducted on real-world and synthetic datasets. The evaluation results demonstrate the efficiency of the proposed relationship discovery approach. Pas de mot clé en français Relationship discovery Social networks Rule-based relationship identification Coreferent users Link mining Link type prediction Entity resolution Classification Crowdsourcing User profiles Photos Metadata 004.678

1

Page generated in 0.057 seconds