Spelling suggestions: "subject:"semisupervised classification"" "subject:"semissupervised classification""
1 |
Novel Measures on Directed Graphs and Applications to Large-Scale Within-Network ClassificationMantrach, Amin 25 October 2010 (has links)
Ces dernières années, les réseaux sont devenus une source importante d’informations dans différents domaines aussi variés que les sciences sociales, la physique ou les mathématiques. De plus, la taille de ces réseaux n’a cessé de grandir de manière conséquente. Ce constat a vu émerger de nouveaux défis, comme le besoin de mesures précises et intuitives pour caractériser et analyser ces réseaux de grandes tailles en un temps raisonnable.
La première partie de cette thèse introduit une nouvelle mesure de similarité entre deux noeuds d’un réseau dirigé et pondéré : la covariance “sum-over-paths”. Celle-ci a une interprétation claire et précise : en dénombrant tous les chemins possibles deux noeuds sont considérés comme fortement corrélés s’ils apparaissent souvent sur un même chemin – de préférence court. Cette mesure dépend d’une distribution de probabilités, définie sur l’ensemble infini dénombrable des chemins dans le graphe, obtenue en minimisant l'espérance du coût total entre toutes les paires de noeuds du graphe sachant que l'entropie relative totale injectée dans le réseau est fixée à priori. Le paramètre d’entropie permet de biaiser la distribution de probabilité sur un large spectre : allant de marches aléatoires naturelles où tous les chemins sont équiprobables à des marches biaisées en faveur des plus courts chemins. Cette mesure est alors appliquée à des problèmes de classification semi-supervisée sur des réseaux de taille moyennes et comparée à l’état de l’art.
La seconde partie de la thèse introduit trois nouveaux algorithmes de classification de noeuds en sein d’un large réseau dont les noeuds sont partiellement étiquetés. Ces algorithmes ont un temps de calcul linéaire en le nombre de noeuds, de classes et d’itérations, et peuvent dés lors être appliqués sur de larges réseaux. Ceux-ci ont obtenus des résultats compétitifs en comparaison à l’état de l’art sur le large réseaux de citations de brevets américains et sur huit autres jeux de données. De plus, durant la thèse, nous avons collecté un nouveau jeu de données, déjà mentionné : le réseau de citations de brevets américains. Ce jeu de données est maintenant disponible pour la communauté pour la réalisation de tests comparatifs.
La partie finale de cette thèse concerne la combinaison d’un graphe de citations avec les informations présentes sur ses noeuds. De manière empirique, nous avons montré que des données basées sur des citations fournissent de meilleurs résultats de classification que des données basées sur des contenus textuels. Toujours de manière empirique, nous avons également montré que combiner les différentes sources d’informations (contenu et citations) doit être considéré lors d’une tâche de classification de textes. Par exemple, lorsqu’il s’agit de catégoriser des articles de revues, s’aider d’un graphe de citations extrait au préalable peut améliorer considérablement les performances. Par contre, dans un autre contexte, quand il s’agit de directement classer les noeuds du réseau de citations, s’aider des informations présentes sur les noeuds n’améliora pas nécessairement les performances.
La théorie, les algorithmes et les applications présentés dans cette thèse fournissent des perspectives intéressantes dans différents domaines.
In recent years, networks have become a major data source in various fields ranging from social sciences to mathematical and physical sciences. Moreover, the size of available networks has grow substantially as well. This has brought with it a number of new challenges, like the need for precise and intuitive measures to characterize and analyze large scale networks in a reasonable time.
The first part of this thesis introduces a novel measure between two nodes of a weighted directed graph: The sum-over-paths covariance. It has a clear and intuitive interpretation: two nodes are considered as highly correlated if they often co-occur on the same -- preferably short -- paths. This measure depends on a probability distribution over the (usually infinite) countable set of paths through the graph which is obtained by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. The entropy parameter allows to bias the probability distribution over a wide spectrum: going from natural random walks (where all paths are equiprobable) to walks biased towards shortest-paths. This measure is then applied to semi-supervised classification problems on medium-size networks and compared to state-of-the-art techniques.
The second part introduces three novel algorithms for within-network classification in large-scale networks, i.e., classification of nodes in partially labeled graphs. The algorithms have a linear computing time in the number of edges, classes and steps and hence can be applied to large scale networks. They obtained competitive results in comparison to state-of-the-art technics on the large scale U.S.~patents citation network and on eight other data sets. Furthermore, during the thesis, we collected a novel benchmark data set: the U.S.~patents citation network. This data set is now available to the community for benchmarks purposes.
The final part of the thesis concerns the combination of a citation graph with information on its nodes. We show that citation-based data provide better results for classification than content-based data. We also show empirically that combining both sources of information (content-based and citation-based) should be considered when facing a text categorization problem. For instance, while classifying journal papers, considering to extract an external citation graph may considerably boost the performance. However, in another context, when we have to directly classify the network citation nodes, then the help of features on nodes will not improve the results.
The theory, algorithms and applications presented in this thesis provide interesting perspectives in various fields.
|
2 |
Classificação semissupervisionada de séries temporais extraídas de imagens de satélite / Semi-supervised classification of time series extracted from satellite imagesAmaral, Bruno Ferraz do 29 April 2016 (has links)
Nas últimas décadas, com o crescimento acelerado na geração e armazenamento de dados, houve um aumento na necessidade de criação e gerenciamento de grandes bases de dados. Logo, a utilização de técnicas de mineração de dados adequadas para descoberta de padrões e informações úteis em bases de dados é uma tarefa de interesse. Em especial, bases de séries temporais têm sido alvo de pesquisas em áreas como medicina, economia e agrometeorologia. Em mineração de dados, uma das tarefas mais exploradas é a classificação. Entretanto, é comum em bases de séries temporais, a quantidade e complexidade de dados extrapolarem a capacidade humana de análise manual dos dados, o que torna o processo de supervisão dos dados custoso. Como consequência disso, são produzidos poucos dados rotulados, em comparação a um grande volume de dados não rotulados disponíveis. Nesse cenário, uma abordagem adequada para análise desses dados é a classificação semissupervisionada, que considera dados rotulados e não rotulados para o treinamento do classificador. Nesse contexto, este trabalho de mestrado propõe 1) uma metodologia de análise de dados obtidos a partir de séries temporais de imagens de satélite (SITS) usando tarefas de mineração de dados e 2) uma técnica baseada em grafos para classificação semissupervisionada de séries temporais extraídas de imagens de satélite. A metodologia e a técnica de classificação desenvolvidas são aplicadas na análise de séries temporais de índices de vegetação obtidas a partir de SITS, visando a identificação de áreas de plantio de cana-de-açúcar. Os resultados obtidos em análise experimental, realizada com apoio de especialistas no domínio de aplicação, indicam que a metodologia proposta é adequada para auxiliar pesquisas em agricultura. Além disso, os resultados do estudo comparativo mostram que a técnica de classificação semissupervisionada desenvolvida supera métodos de classificação supervisionada consolidados na literatura e métodos correlatos de classificação semissupervisionada. / The amount of digital data generated and stored as well as the need of creation and management of large databases has increased significantly, in the last decades. The possibility of finding valid and potentially useful patterns and information in large databases has attracted the attention of many scientific areas. Time series databases have been explored using data mining methods in serveral domains of application, such as economics, medicine and agrometeorology. Due to the large volume and complexity of some time series databases, the process of labeling data for supervised tasks, such as classification, can be very expensive. To overcome the problem of scarcity of labeled data, semi-supervised classification, which benefits from both labeled and unlabeled data available, can be applied to classify data from large time series databases. In this Master dissertation, we propose 1) a framework for the analysis of data extracted from satellite image time series (SITS) using data mining tasks and 2) a graph-based semi-supervised classification method, developed to classify temporal data obtained from satellite images. According to experts in agrometeorology, the use of the proposed method and framework provides an automatic way of analyzing data extracted from SITS, which is very useful for supporting research in this domain of application. We apply the framework and the proposed semi-supervised classification method in the analysis of vegetation index time series, aiming at identifying sugarcane crop fields, in Brazil. Experimental results indicate that our proposed framework is useful for supporting researches in agriculture, according to experts in the domain of application. We also show that our method is more accurate than traditional supervised methods and related semi-supervised methods.
|
3 |
Classificação semi-supervisionada ativa baseada em múltiplas hierarquias de agrupamento / Active semi-supervised classification based on multiple clustering hierarchiesBatista, Antônio José de Lima 08 August 2016 (has links)
Algoritmos de aprendizado semi-supervisionado ativo podem se configurar como ferramentas úteis em cenários práticos em que os dados são numerosamente obtidos, mas atribuir seus respectivos rótulos de classe se configura como uma tarefa custosa/difícil. A literatura em aprendizado ativo destaca diversos algoritmos, este trabalho partiu do tradicional Hierarchical Sampling estabelecido para operar sobre hierarquias de grupos. As características de tal algoritmo o coloca à frente de outros métodos ativos, entretanto o mesmo ainda apresenta algumas dificuldades. A fim de aprimorá-lo e contornar suas principais dificuldades, incluindo sua sensibilidade na escolha particular de uma hierarquia de grupos como entrada, este trabalho propôs estratégias que possibilitaram melhorar o algoritmo na sua forma original e diante de variantes propostas na literatura. Os experimentos em diferentes bases de dados reais mostraram que o algoritmo proposto neste trabalho é capaz de superar e competir em qualidade dentro do cenário de classificação ativa com outros algoritmos ativos da literatura. / Active semi-supervised learning can play an important role in classification scenarios in which labeled data are laborious and/or expensive to obtain, while unlabeled data are numerous and can be easily acquired. There are many active algorithms in the literature and this work focuses on an active semi-supervised algorithm that can be driven by clustering hierarchy, the well-known Hierarchical Sampling (HS) algorithm. This work takes as a starting point the original Hierarchical Sampling algorithm and perform changes in different aspects of the original algorithm in order to tackle its main drawbacks, including its sensitivity to the choice of a single particular hierarchy. Experimental results over many real datasets show that the proposed algorithm performs superior or competitive when compared to a number of state-of-the-art algorithms for active semi-supervised classification.
|
4 |
Proposition d'une méthode spectrale combinée LDA et LLE pour la réduction non-linéaire de dimension : Application à la segmentation d'images couleurs / Proposition of a new spectral method combining LDA and LLE for non-linear dimension reduction : Application to color images segmentationHijazi, Hala 19 December 2013 (has links)
Les méthodes d'analyse de données et d'apprentissage ont connu un développement très important ces dernières années. En effet, après les réseaux de neurones, les machines à noyaux (années 1990), les années 2000 ont vu l'apparition de méthodes spectrales qui ont fourni un cadre mathématique unifié pour développer des méthodes de classification originales. Parmi celles-ci ont peut citer la méthode LLE pour la réduction de dimension non linéaire et la méthode LDA pour la discrimination de classes. Une nouvelle méthode de classification est proposée dans cette thèse, méthode issue d'une combinaison des méthodes LLE et LDA. Cette méthode a donné des résultats intéressants sur des ensembles de données synthétiques. Elle permet une réduction de dimension non-linéaire suivie d'une discrimination efficace. Ensuite nous avons montré que cette méthode pouvait être étendue à l'apprentissage semi-supervisé. Les propriétés de réduction de dimension et de discrimination de cette nouvelle méthode, ainsi que la propriété de parcimonie inhérente à la méthode LLE nous ont permis de l'appliquer à la segmentation d'images couleur avec succès. La propriété d'apprentissage semi-supervisé nous a enfin permis de segmenter des images bruitées avec de bonnes performances. Ces résultats doivent être confortés mais nous pouvons d'ores et déjà dégager des perspectives de poursuite de travaux intéressantes. / Data analysis and learning methods have known a huge development during these last years. Indeed, after neural networks, kernel methods in the 90', spectral methods appeared in the years 2000. Spectral methods provide an unified mathematical framework to expand new original classification methods. Among these new techniques, two methods can be highlighted : LLE for non-linear dimension reduction and LDA as discriminating classification method. In this thesis document a new classification technique is proposed combining LLE and LDA methods. This new method makes it possible to provide efficient non-linear dimension reduction and discrimination. Then an extension of the method to semi-supervised learning is proposed. Good properties of dimension reduction and discrimination associated with the sparsity property of the LLE technique make it possible to apply our method to color images segmentation with success. Semi-supervised version of our method leads to efficient segmentation of noisy color images. These results have to be extended and compared with other state-of-the-art methods. Nevertheless interesting perspectives of this work are proposed in conclusion for future developments.
|
5 |
Classificação semi-supervisionada ativa baseada em múltiplas hierarquias de agrupamento / Active semi-supervised classification based on multiple clustering hierarchiesAntônio José de Lima Batista 08 August 2016 (has links)
Algoritmos de aprendizado semi-supervisionado ativo podem se configurar como ferramentas úteis em cenários práticos em que os dados são numerosamente obtidos, mas atribuir seus respectivos rótulos de classe se configura como uma tarefa custosa/difícil. A literatura em aprendizado ativo destaca diversos algoritmos, este trabalho partiu do tradicional Hierarchical Sampling estabelecido para operar sobre hierarquias de grupos. As características de tal algoritmo o coloca à frente de outros métodos ativos, entretanto o mesmo ainda apresenta algumas dificuldades. A fim de aprimorá-lo e contornar suas principais dificuldades, incluindo sua sensibilidade na escolha particular de uma hierarquia de grupos como entrada, este trabalho propôs estratégias que possibilitaram melhorar o algoritmo na sua forma original e diante de variantes propostas na literatura. Os experimentos em diferentes bases de dados reais mostraram que o algoritmo proposto neste trabalho é capaz de superar e competir em qualidade dentro do cenário de classificação ativa com outros algoritmos ativos da literatura. / Active semi-supervised learning can play an important role in classification scenarios in which labeled data are laborious and/or expensive to obtain, while unlabeled data are numerous and can be easily acquired. There are many active algorithms in the literature and this work focuses on an active semi-supervised algorithm that can be driven by clustering hierarchy, the well-known Hierarchical Sampling (HS) algorithm. This work takes as a starting point the original Hierarchical Sampling algorithm and perform changes in different aspects of the original algorithm in order to tackle its main drawbacks, including its sensitivity to the choice of a single particular hierarchy. Experimental results over many real datasets show that the proposed algorithm performs superior or competitive when compared to a number of state-of-the-art algorithms for active semi-supervised classification.
|
6 |
Classificação semissupervisionada de séries temporais extraídas de imagens de satélite / Semi-supervised classification of time series extracted from satellite imagesBruno Ferraz do Amaral 29 April 2016 (has links)
Nas últimas décadas, com o crescimento acelerado na geração e armazenamento de dados, houve um aumento na necessidade de criação e gerenciamento de grandes bases de dados. Logo, a utilização de técnicas de mineração de dados adequadas para descoberta de padrões e informações úteis em bases de dados é uma tarefa de interesse. Em especial, bases de séries temporais têm sido alvo de pesquisas em áreas como medicina, economia e agrometeorologia. Em mineração de dados, uma das tarefas mais exploradas é a classificação. Entretanto, é comum em bases de séries temporais, a quantidade e complexidade de dados extrapolarem a capacidade humana de análise manual dos dados, o que torna o processo de supervisão dos dados custoso. Como consequência disso, são produzidos poucos dados rotulados, em comparação a um grande volume de dados não rotulados disponíveis. Nesse cenário, uma abordagem adequada para análise desses dados é a classificação semissupervisionada, que considera dados rotulados e não rotulados para o treinamento do classificador. Nesse contexto, este trabalho de mestrado propõe 1) uma metodologia de análise de dados obtidos a partir de séries temporais de imagens de satélite (SITS) usando tarefas de mineração de dados e 2) uma técnica baseada em grafos para classificação semissupervisionada de séries temporais extraídas de imagens de satélite. A metodologia e a técnica de classificação desenvolvidas são aplicadas na análise de séries temporais de índices de vegetação obtidas a partir de SITS, visando a identificação de áreas de plantio de cana-de-açúcar. Os resultados obtidos em análise experimental, realizada com apoio de especialistas no domínio de aplicação, indicam que a metodologia proposta é adequada para auxiliar pesquisas em agricultura. Além disso, os resultados do estudo comparativo mostram que a técnica de classificação semissupervisionada desenvolvida supera métodos de classificação supervisionada consolidados na literatura e métodos correlatos de classificação semissupervisionada. / The amount of digital data generated and stored as well as the need of creation and management of large databases has increased significantly, in the last decades. The possibility of finding valid and potentially useful patterns and information in large databases has attracted the attention of many scientific areas. Time series databases have been explored using data mining methods in serveral domains of application, such as economics, medicine and agrometeorology. Due to the large volume and complexity of some time series databases, the process of labeling data for supervised tasks, such as classification, can be very expensive. To overcome the problem of scarcity of labeled data, semi-supervised classification, which benefits from both labeled and unlabeled data available, can be applied to classify data from large time series databases. In this Master dissertation, we propose 1) a framework for the analysis of data extracted from satellite image time series (SITS) using data mining tasks and 2) a graph-based semi-supervised classification method, developed to classify temporal data obtained from satellite images. According to experts in agrometeorology, the use of the proposed method and framework provides an automatic way of analyzing data extracted from SITS, which is very useful for supporting research in this domain of application. We apply the framework and the proposed semi-supervised classification method in the analysis of vegetation index time series, aiming at identifying sugarcane crop fields, in Brazil. Experimental results indicate that our proposed framework is useful for supporting researches in agriculture, according to experts in the domain of application. We also show that our method is more accurate than traditional supervised methods and related semi-supervised methods.
|
7 |
Semi-Supervised Classification Using Gaussian ProcessesPatel, Amrish 01 1900 (has links)
Gaussian Processes (GPs) are promising Bayesian methods for classification and regression problems. They have also been used for semi-supervised classification tasks. In this thesis, we propose new algorithms for solving semi-supervised binary classification problem using GP regression (GPR) models. The algorithms are closely related to semi-supervised classification based on support vector regression (SVR) and maximum margin clustering. The proposed algorithms are simple and easy to implement. Also, the hyper-parameters are estimated without resorting to expensive cross-validation technique. The algorithm based on sparse GPR model gives a sparse solution directly unlike the SVR based algorithm. Use of sparse GPR model helps in making the proposed algorithm scalable. The results of experiments on synthetic and real-world datasets demonstrate the efficacy of proposed sparse GP based algorithm for semi-supervised classification.
|
8 |
Plug-in methods in classification / Méthodes de type plug-in en classificationChzhen, Evgenii 25 September 2019 (has links)
Ce manuscrit étudie plusieurs problèmes de classification sous contraintes. Dans ce cadre de classification, notre objectif est de construire un algorithme qui a des performances aussi bonnes que la meilleure règle de classification ayant une propriété souhaitée. Fait intéressant, les méthodes de classification de type plug-in sont bien appropriées à cet effet. De plus, il est montré que, dans plusieurs configurations, ces règles de classification peuvent exploiter des données non étiquetées, c'est-à-dire qu'elles sont construites de manière semi-supervisée. Le Chapitre 1 décrit deux cas particuliers de la classification binaire - la classification où la mesure de performance est reliée au F-score, et la classification équitable. A ces deux problèmes, des procédures semi-supervisées sont proposées. En particulier, dans le cas du F-score, il s'avère que cette méthode est optimale au sens minimax sur une classe usuelle de distributions non-paramétriques. Aussi, dans le cas de la classification équitable, la méthode proposée est consistante en terme de risque de classification, tout en satisfaisant asymptotiquement la contrainte d’égalité des chances. De plus, la procédure proposée dans ce cadre d'étude surpasse en pratique les algorithmes de pointe. Le Chapitre 3 décrit le cadre de la classification multi-classes par le biais d'ensembles de confiance. Là encore, une procédure semi-supervisée est proposée et son optimalité presque minimax est établie. Il est en outre établi qu'aucun algorithme supervisé ne peut atteindre une vitesse de convergence dite rapide. Le Chapitre 4 décrit un cas de classification multi-labels dans lequel on cherche à minimiser le taux de faux-négatifs sous réserve de contraintes de type presque sûres sur les règles de classification. Dans cette partie, deux contraintes spécifiques sont prises en compte: les classifieurs parcimonieux et ceux soumis à un contrôle des erreurs négatives à tort. Pour les premiers, un algorithme supervisé est fourni et il est montré que cet algorithme peut atteindre une vitesse de convergence rapide. Enfin, pour la seconde famille, il est montré que des hypothèses supplémentaires sont nécessaires pour obtenir des garanties théoriques sur le risque de classification / This manuscript studies several problems of constrained classification. In this frameworks of classification our goal is to construct an algorithm which performs as good as the best classifier that obeys some desired property. Plug-in type classifiers are well suited to achieve this goal. Interestingly, it is shown that in several setups these classifiers can leverage unlabeled data, that is, they are constructed in a semi-supervised manner.Chapter 2 describes two particular settings of binary classification -- classification with F-score and classification of equal opportunity. For both problems semi-supervised procedures are proposed and their theoretical properties are established. In the case of the F-score, the proposed procedure is shown to be optimal in minimax sense over a standard non-parametric class of distributions. In the case of the classification of equal opportunity the proposed algorithm is shown to be consistent in terms of the misclassification risk and its asymptotic fairness is established. Moreover, for this problem, the proposed procedure outperforms state-of-the-art algorithms in the field.Chapter 3 describes the setup of confidence set multi-class classification. Again, a semi-supervised procedure is proposed and its nearly minimax optimality is established. It is additionally shown that no supervised algorithm can achieve a so-called fast rate of convergence. In contrast, the proposed semi-supervised procedure can achieve fast rates provided that the size of the unlabeled data is sufficiently large.Chapter 4 describes a setup of multi-label classification where one aims at minimizing false negative error subject to almost sure type constraints. In this part two specific constraints are considered -- sparse predictions and predictions with the control over false negative errors. For the former, a supervised algorithm is provided and it is shown that this algorithm can achieve fast rates of convergence. For the later, it is shown that extra assumptions are necessary in order to obtain theoretical guarantees in this case
|
9 |
Novel measures on directed graphs and applications to large-scale within-network classificationMantrach, Amin 25 October 2010 (has links)
Ces dernières années, les réseaux sont devenus une source importante d’informations dans différents domaines aussi variés que les sciences sociales, la physique ou les mathématiques. De plus, la taille de ces réseaux n’a cessé de grandir de manière conséquente. Ce constat a vu émerger de nouveaux défis, comme le besoin de mesures précises et intuitives pour caractériser et analyser ces réseaux de grandes tailles en un temps raisonnable.<p>La première partie de cette thèse introduit une nouvelle mesure de similarité entre deux noeuds d’un réseau dirigé et pondéré :la covariance “sum-over-paths”. Celle-ci a une interprétation claire et précise :en dénombrant tous les chemins possibles deux noeuds sont considérés comme fortement corrélés s’ils apparaissent souvent sur un même chemin – de préférence court. Cette mesure dépend d’une distribution de probabilités, définie sur l’ensemble infini dénombrable des chemins dans le graphe, obtenue en minimisant l'espérance du coût total entre toutes les paires de noeuds du graphe sachant que l'entropie relative totale injectée dans le réseau est fixée à priori. Le paramètre d’entropie permet de biaiser la distribution de probabilité sur un large spectre :allant de marches aléatoires naturelles où tous les chemins sont équiprobables à des marches biaisées en faveur des plus courts chemins. Cette mesure est alors appliquée à des problèmes de classification semi-supervisée sur des réseaux de taille moyennes et comparée à l’état de l’art.<p>La seconde partie de la thèse introduit trois nouveaux algorithmes de classification de noeuds en sein d’un large réseau dont les noeuds sont partiellement étiquetés. Ces algorithmes ont un temps de calcul linéaire en le nombre de noeuds, de classes et d’itérations, et peuvent dés lors être appliqués sur de larges réseaux. Ceux-ci ont obtenus des résultats compétitifs en comparaison à l’état de l’art sur le large réseaux de citations de brevets américains et sur huit autres jeux de données. De plus, durant la thèse, nous avons collecté un nouveau jeu de données, déjà mentionné :le réseau de citations de brevets américains. Ce jeu de données est maintenant disponible pour la communauté pour la réalisation de tests comparatifs.<p>La partie finale de cette thèse concerne la combinaison d’un graphe de citations avec les informations présentes sur ses noeuds. De manière empirique, nous avons montré que des données basées sur des citations fournissent de meilleurs résultats de classification que des données basées sur des contenus textuels. Toujours de manière empirique, nous avons également montré que combiner les différentes sources d’informations (contenu et citations) doit être considéré lors d’une tâche de classification de textes. Par exemple, lorsqu’il s’agit de catégoriser des articles de revues, s’aider d’un graphe de citations extrait au préalable peut améliorer considérablement les performances. Par contre, dans un autre contexte, quand il s’agit de directement classer les noeuds du réseau de citations, s’aider des informations présentes sur les noeuds n’améliora pas nécessairement les performances.<p>La théorie, les algorithmes et les applications présentés dans cette thèse fournissent des perspectives intéressantes dans différents domaines.<p><p><p>In recent years, networks have become a major data source in various fields ranging from social sciences to mathematical and physical sciences. Moreover, the size of available networks has grow substantially as well. This has brought with it a number of new challenges, like the need for precise and intuitive measures to characterize and analyze large scale networks in a reasonable time. <p>The first part of this thesis introduces a novel measure between two nodes of a weighted directed graph: The sum-over-paths covariance. It has a clear and intuitive interpretation: two nodes are considered as highly correlated if they often co-occur on the same -- preferably short -- paths. This measure depends on a probability distribution over the (usually infinite) countable set of paths through the graph which is obtained by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. The entropy parameter allows to bias the probability distribution over a wide spectrum: going from natural random walks (where all paths are equiprobable) to walks biased towards shortest-paths. This measure is then applied to semi-supervised classification problems on medium-size networks and compared to state-of-the-art techniques.<p>The second part introduces three novel algorithms for within-network classification in large-scale networks, i.e. classification of nodes in partially labeled graphs. The algorithms have a linear computing time in the number of edges, classes and steps and hence can be applied to large scale networks. They obtained competitive results in comparison to state-of-the-art technics on the large scale U.S.~patents citation network and on eight other data sets. Furthermore, during the thesis, we collected a novel benchmark data set: the U.S.~patents citation network. This data set is now available to the community for benchmarks purposes. <p>The final part of the thesis concerns the combination of a citation graph with information on its nodes. We show that citation-based data provide better results for classification than content-based data. We also show empirically that combining both sources of information (content-based and citation-based) should be considered when facing a text categorization problem. For instance, while classifying journal papers, considering to extract an external citation graph may considerably boost the performance. However, in another context, when we have to directly classify the network citation nodes, then the help of features on nodes will not improve the results.<p>The theory, algorithms and applications presented in this thesis provide interesting perspectives in various fields.<p> / Doctorat en Sciences / info:eu-repo/semantics/nonPublished
|
10 |
"The Trees Act Not as Individuals"--Learning to See the Whole Picture in Biology Education and Remote Sensing ResearchGreenall, Rebeka A.F. 18 August 2023 (has links) (PDF)
To increase equity and inclusion for underserved and excluded Indigenous students, we must make efforts to mitigate the unique barriers they face. As their knowledge systems have been historically excluded and erased in Western science, we begin by reviewing the literature on the inclusion of Traditional Ecological Knowledge (TEK) in biology education and describe best practices. Next, to better understand how Native Hawaiian and other Pacific Islander (NHPI) students integrate into the scientific community, we used Social Influence Theory as a framework to measure NHPI student science identity, self-efficacy, alignment with science values, and belonging. We also investigated how students feel their ethnic and science identities interact. We found that NHPI students do not significantly differ from non-NHPI students in these measures of integration, and that NHPI students are varied in how they perceive their ethnic and science identities interact. Some students experience conflict between the two identities, while others view the two as having a strengthening relationship. Next, we describe a lesson plan created to include Hawaiian TEK in a biology class using best practices described in the literature. This is followed by an empirical study on how students were impacted by this lesson. We measured student integration into the science community using science identity, self-efficacy, alignment with science values, and belonging. We found no significant differences between NHPI and non-NHPI students. We also looked at student participation, and found that all students participated more on intervention days involving TEK and other ways of knowing than on non-intervention days. Finally, we describe qualitative findings on how students were impacted by the TEK interventions. We found students were predominantly positively impacted by the inclusion of TEK and discuss future adjustments that could be made using their recommendations. The last chapter describes how we used remote sensing to investigate land cover in a fenced and unfenced region of the Koʻolau Mountains on the island of Oahu. After mapping the biodiversity hotspot Management Unit of Koloa, we found that there is slighlty more bare ground, grass, and bare ground/low vegetation mix in fenced, and thereby ungulate-free areas, than those that were unfenced and had ungulates. Implications of these findings and suggestions for future research are discussed.
|
Page generated in 0.1008 seconds