Global ETD Search

61	A distributed kernel summation framework for machine learning and scientific applications Lee, Dong Ryeol 11 May 2012 (has links) The class of computational problems I consider in this thesis share the common trait of requiring consideration of pairs (or higher-order tuples) of data points. I focus on the problem of kernel summation operations ubiquitous in many data mining and scientific algorithms. In machine learning, kernel summations appear in popular kernel methods which can model nonlinear structures in data. Kernel methods include many non-parametric methods such as kernel density estimation, kernel regression, Gaussian process regression, kernel PCA, and kernel support vector machines (SVM). In computational physics, kernel summations occur inside the classical N-body problem for simulating positions of a set of celestial bodies or atoms. This thesis attempts to marry, for the first time, the best relevant techniques in parallel computing, where kernel summations are in low dimensions, with the best general-dimension algorithms from the machine learning literature. We provide a unified, efficient parallel kernel summation framework that can utilize: (1) various types of deterministic and probabilistic approximations that may be suitable for both low and high-dimensional problems with a large number of data points; (2) indexing the data using any multi-dimensional binary tree with both distributed memory (MPI) and shared memory (OpenMP/Intel TBB) parallelism; (3) a dynamic load balancing scheme to adjust work imbalances during the computation. I will first summarize my previous research in serial kernel summation algorithms. This work started from Greengard/Rokhlin's earlier work on fast multipole methods for the purpose of approximating potential sums of many particles. The contributions of this part of this thesis include the followings: (1) reinterpretation of Greengard/Rokhlin's work for the computer science community; (2) the extension of the algorithms to use a larger class of approximation strategies, i.e. probabilistic error bounds via Monte Carlo techniques; (3) the multibody series expansion: the generalization of the theory of fast multipole methods to handle interactions of more than two entities; (4) the first O(N) proof of the batch approximate kernel summation using a notion of intrinsic dimensionality. Then I move onto the problem of parallelization of the kernel summations and tackling the scaling of two other kernel methods, Gaussian process regression (kernel matrix inversion) and kernel PCA (kernel matrix eigendecomposition). The artifact of this thesis has contributed to an open-source machine learning package called MLPACK which has been first demonstrated at the NIPS 2008 and subsequently at the NIPS 2011 Big Learning Workshop. Completing a portion of this thesis involved utilization of high performance computing resource at XSEDE (eXtreme Science and Engineering Discovery Environment) and NERSC (National Energy Research Scientific Computing Center). Parallel multitree methods Fast Gauss transforms Fast multipole methods Parallel machine learning Parallel kernel methods Multidimensional trees Kernel functions Machine learning Algorithms
62	Evaluation of a neural network for formulating a semi-empirical variable kernel BRDF model Manoharan, Madhu, January 2005 (has links) Thesis (M.S.) -- Mississippi State University. Department of Electrical and Computer Engineering. / Title from title screen. Includes bibliographical references.
63	Kernel methods for flight data monitoring / Méthodes à noyau pour l'analyse de données de vols appliquées aux opérations aériennes Chrysanthos, Nicolas 24 October 2014 (has links) L'analyse de données de vols appliquée aux opérations aériennes ou "Flight Data Monitoring" (FDM), est le processus par lequel une compagnie aérienne recueille, analyse et traite de façon régulière les données enregistrées dans les avions, dans le but d'améliorer de façon globale la sécurité.L'objectif de cette thèse est d'élaborer dans le cadre des méthodes à noyau, des techniques pour la détection des vols atypiques qui présentent potentiellement des problèmes qui ne peuvent être trouvés en utilisant les méthodes classiques. Dans la première partie, nous proposons une nouvelle méthode pour la détection d'anomalies.Nous utilisons une nouvelle technique de réduction de dimension appelée analyse en entropie principale par noyau afin de concevoir une méthode qui est à la fois non supervisée et robuste.Dans la deuxième partie, nous résolvons le problème de la structure des données dans le domaine FDM.Tout d'abord, nous étendons la méthode pour prendre en compte les paramètres de différents types tels que continus, discrets ou angulaires.Ensuite, nous explorons des techniques permettant de prendre en compte l'aspect temporel des vols et proposons un nouveau noyau dans la famille des techniques de déformation de temps dynamique, et démontrons qu'il est plus rapide à calculer que les techniques concurrentes et est de plus défini positif.Nous illustrons notre approche avec des résultats prometteurs sur des données réelles des compagnies aériennes TAP et Transavia comprenant plusieurs centaines de vols / Flight Data Monitoring (FDM), is the process by which an airline routinely collects, processes, and analyses the data recorded in aircrafts with the goal of improving the overall safety or operational efficiency.The goal of this thesis is to investigate machine learning methods, and in particular kernel methods, for the detection of atypical flights that may present problems that cannot be found using traditional methods.Atypical flights may present safety of operational issues and thus need to be studied by an FDM expert.In the first part we propose a novel method for anomaly detection that is suited to the constraints of the field of FDM.We rely on a novel dimensionality reduction technique called kernel entropy component analysis to design a method which is both unsupervised and robust.In the second part we solve the most salient issue regarding the field of FDM, which is how the data is structured.Firstly, we extend the method to take into account parameters of diverse types such as continuous, discrete or angular.Secondly, we explore techniques to take into account the temporal aspect of flights and propose a new kernel in the family of dynamic time warping techniques, and demonstrate that it is faster to compute than competing techniques and is positive definite.We illustrate our approach with promising results on real world datasets from airlines TAP and Transavia comprising hundreds of flights Noyaux (analyse fonctionnelle) Analyse discriminante Information, Théorie de l' Structures de données Aéronautique -- Mesures de sécurité Kernel functions Discriminant analysis Information theory Data structures (Computer science) Aeronautics -- Safety mesures 629.13
64	Infinitely Divisible Metrics, Curvature Inequalities And Curvature Formulae Keshari, Dinesh Kumar 07 1900 (has links) (PDF) The curvature of a contraction T in the Cowen-Douglas class is bounded above by the curvature of the backward shift operator. However, in general, an operator satisfying the curvature inequality need not be contractive. In this thesis, we characterize a slightly smaller class of contractions using a stronger form of the curvature inequality. Along the way, we find conditions on the metric of the holomorphic Hermitian vector bundle E corresponding to the operator T in the Cowen-Douglas class which ensures negative definiteness of the curvature function. We obtain a generalization for commuting tuples of operators in the Cowen-Douglas class. Secondly, we obtain an explicit formula for the curvature of the jet bundle of the Hermitian holomorphic bundle E f on a planar domain Ω. Here Ef is assumed to be a pull-back of the tautological bundle on gr(n, H ) by a nondegenerate holomorphic map f :Ω →Gr (n, H ). Clearly, finding relationships amongs the complex geometric invariants inherent in the short exact sequence 0 → Jk(Ef ) → Jk+1(Ef ) →J k+1(Ef )/ Jk(Ef ) → 0 is an important problem, whereJk(Ef ) represents the k-th order jet bundle. It is known that the Chern classes of these bundles must satisfy c(Jk+1(Ef )) = c(Jk(Ef )) c(Jk+1(Ef )/ Jk(Ef )). We obtain a refinement of this formula: trace Idnxn ( KJk(Ef )) - trace Idnxn ( KJk-1(Ef ))= KJk(Ef )/ Jk-1(Ef )(z). Hilbert Space Curvature Inequalities Cowen-Douglas Class Of Operators Curvature of a Contraction Jet Bundles (Mathematics) Vector Bundles Hermitian Holomorphic Vector Bundle Infinitely Divisible Metrics Kernel Functions Geometry
65	The asymptotic stability of stochastic kernel operators Brown, Thomas John 06 1900 (has links) A stochastic operator is a positive linear contraction, P : L1 --+ L1, such that llPfII2 = llfll1 for f > 0. It is called asymptotically stable if the iterates pn f of each density converge in the norm to a fixed density. Pf(x) = f K(x,y)f(y)dy, where K( ·, y) is a density, defines a stochastic kernel operator. A general probabilistic/ deterministic model for biological systems is considered. This leads to the LMT operator P f(x) = Jo - Bx H(Q(>.(x)) - Q(y)) dy, where -H'(x) = h(x) is a density. Several particular examples of cell cycle models are examined. An operator overlaps supports iffor all densities f,g, pn f APng of 0 for some n. If the operator is partially kernel, has a positive invariant density and overlaps supports, it is asymptotically stable. It is found that if h( x) > 0 for x ~ xo ~ 0 and ["'" x"h(x) dx < liminf(Q(A(x))" - Q(x)") for a E (0, 1] lo x-oo then P is asymptotically stable, and an opposite condition implies P is sweeping. Many known results for cell cycle models follow from this. / Mathematical Science / M. Sc. (Mathematics) Markov operator Stochastic operator Asymptotic stability Ergodic theory Biological models Cell cycle models Kernel operations Doubly stochastic operators Harris operators Stochastic process 515.7246 Kernel functions Operator equations -- Asymptotic theory Ergodic theory Cell cycle Stochastic processes Random operators
66	Hydrologic Impacts Of Climate Change : Uncertainty Modeling Ghosh, Subimal 07 1900 (has links) General Circulation Models (GCMs) are tools designed to simulate time series of climate variables globally, accounting for eﬀects of greenhouse gases in the atmosphere. They attempt to represent the physical processes in the atmosphere, ocean, cryosphere and land surface. They are currently the most credible tools available for simulating the response of the global climate system to increasing greenhouse gas concentrations, and to provide estimates of climate variables (e.g. air temperature, precipitation, wind speed, pressure etc.) on a global scale. GCMs demonstrate a signiﬁcant skill at the continental and hemispheric spatial scales and incorporate a large proportion of the complexity of the global system; they are, however, inherently unable to represent local subgrid-scale features and dynamics. The spatial scale on which a GCM can operate (e.g., 3.75° longitude x 3.75° latitude for Coupled Global Climate Model, CGCM2) is very coarse compared to that of a hydrologic process (e.g., precipitation in a region, streamﬂow in a river etc.) of interest in the climate change impact assessment studies. Moreover, accuracy of GCMs, in general, decreases from climate related variables, such as wind, temperature, humidity and air pressure to hydrologic variables such as precipitation, evapotranspiration, runoﬀ and soil moisture, which are also simulated by GCMs. These limitations of the GCMs restrict the direct use of their output in hydrology. This thesis deals with developing statistical downscaling models to assess climate change impacts and methodologies to address GCM and scenario uncertainties in assessing climate change impacts on hydrology. Downscaling, in the context of hydrology, is a method to project the hydrologic variables (e.g., rainfall and streamﬂow) at a smaller scale based on large scale climatological variables (e.g., mean sea level pressure) simulated by a GCM. A statistical downscaling model is ﬁrst developed in the thesis to predict the rainfall over Orissa meteorological subdivision from GCM output of large scale Mean Sea Level Pressure (MSLP). Gridded monthly MSLP data for the period 1948 to 2002, are obtained from the National Center for Environmental Prediction/ National Center for Atmospheric Research (NCEP/NCAR) reanalysis project for a region spanning 150 N -250 N in latitude and 800 E -900 E in longitude that encapsulates the study region. The downscaling model comprises of Principal Component Analysis (PCA), Fuzzy Clustering and Linear Regression. PCA is carried out to reduce the dimensionality of the larger scale MSLP and also to convert the correlated variables to uncorrelated variables. Fuzzy clustering is performed to derive the membership of the principal components in each of the clusters and the memberships obtained are used in regression to statistically relate MSLP and rainfall. The statistical relationship thus obtained is used to predict the rainfall from GCM output. The rainfall predicted with the GCM developed by CCSR/NIES with B2 scenario presents a decreasing trend for non-monsoon period, for the case study. Climate change impact assessment models developed based on downscaled GCM output are subjected to a range of uncertainties due to both ‘incomplete knowledge’ and ‘unknowable future scenario’ (New and Hulme, 2000). ‘Incomplete knowledge’ mainly arises from inadequate information and understanding about the underlying geophysical process of global change, leading to limitations in the accuracy of GCMs. This is also termed as GCM uncertainty. Uncertainty due to ‘unknowable future scenario’ is associated with the unpredictability in the forecast of socio-economic and human behavior resulting in future Green House Gas (GHG) emission scenarios, and can also be termed as scenario uncertainty. Downscaled outputs of a single GCM with a single climate change scenario represent a single trajectory among a number of realizations derived using various GCMs and scenarios. Such a single trajectory alone can not represent a future hydrologic scenario, and will not be useful in assessing hydrologic impacts due to climate change. Nonparametric methods are developed in the thesis to model GCM and scenario uncertainty for prediction of drought scenario with Orissa meteorological subdivision as a case study. Using the downscaling technique described in the previous paragraph, future rainfall scenarios are obtained for all available GCMs and scenarios. After correcting for bias, equiprobability transformation is used to convert the precipitation into Standardized Precipitation Index-12 (SPI-12), an annual drought indicator, based on which a drought may be classiﬁed as a severe drought, mild drought etc. Disagreements are observed between diﬀerent predictions of SPI-12, resulting from diﬀerent GCMs and scenarios. Assuming SPI-12 to be a random variable at every time step, nonparametric methods based on kernel density estimation and orthonormal series are used to determine the nonparametric probability density function (pdf) of SPI-12. Probabilities for diﬀerent categories of drought are computed from the estimated pdf. It is observed that there is an increasing trend in the probability of extreme drought and a decreasing trend in the probability of near normal conditions, in the Orissa meteorological subdivision. The single valued Cumulative Distribution Functions (CDFs) obtained from nonparametric methods suﬀer from limitations due to the following: (a) simulations for all scenarios are not available for all the GCMs, thus leading to a possibility that incorporation of these missing climate experiments may result in a diﬀerent CDF, (b) the method may simply overﬁt to a multimodal distribution from a relatively small sample of GCMs with a limited number of scenarios, and (c) the set of all scenarios may not fully compose the universal sample space, and thus, the precise single valued probability distribution may not be representative enough for applications. To overcome these limitations, an interval regression is performed to ﬁt an imprecise normal distribution to the SPI-12 to provide a band of CDFs instead of a single valued CDF. Such a band of CDFs represents the incomplete nature of knowledge, thus reﬂecting the extent of what is ignored in the climate change impact assessment. From imprecise CDFs, the imprecise probabilities of diﬀerent categories of drought are computed. These results also show an increasing trend of the bounds of the probability of extreme drought and decreasing trend of the bounds of the probability of near normal conditions, in the Orissa meteorological subdivision. Water resources planning requires the information about future streamﬂow scenarios in a river basin to combat hydrologic extremes resulting from climate change. It is therefore necessary to downscale GCM projections for streamﬂow prediction at river basin scales. A statistical downscaling model based on PCA, fuzzy clustering and Relevance Vector Machine (RVM) is developed to predict the monsoon streamﬂow of Mahanadi river at Hirakud reservoir, from GCM projections of large scale climatological data. Surface air temperature at 2m, Mean Sea Level Pressure (MSLP), geopotential height at a pressure level of 500 hecto Pascal (hPa) and surface speciﬁc humidity are considered as the predictors for modeling Mahanadi streamﬂow in monsoon season. PCA is used to reduce the dimensionality of the predictor dataset and also to convert the correlated variables to uncorrelated variables. Fuzzy clustering is carried out to derive the membership of the principal components in each of the clusters and the memberships thus obtained are used in RVM regression model. RVM involves fewer number of relevant vectors and the chance of overﬁtting is less than that of Support Vector Machine (SVM). Diﬀerent kernel functions are used for comparison purpose and it is concluded that heavy tailed Radial Basis Function (RBF) performs best for streamﬂow prediction with GCM output for the case considered. The GCM CCSR/NIES with B2 scenario projects a decreasing trend in future monsoon streamﬂow of Mahanadi which is likely to be due to high surface warming. A possibilistic approach is developed next, for modeling GCM and scenario uncertainty in projection of monsoon streamﬂow of Mahanadi river. Three GCMs, Center for Climate System Research/ National Institute for Environmental Studies (CCSR/NIES), Hadley Climate Model 3 (HadCM3) and Coupled Global Climate Model 2 (CGCM2) with two scenarios A2 and B2 are used for the purpose. Possibilities are assigned to GCMs and scenarios based on their system performance measure in predicting the streamﬂow during years 1991-2005, when signals of climate forcing are visible. The possibilities are used as weights for deriving the possibilistic mean CDF for the three standard time slices, 2020s, 2050s and 2080s. It is observed that the value of streamﬂow at which the possibilistic mean CDF reaches the value of 1 reduces with time, which shows reduction in probability of occurrence of extreme high ﬂow events in future and therefore there is likely to be a decreasing trend in the monthly peak ﬂow. One possible reason for such a decreasing trend may be the signiﬁcant increase in temperature due to climate warming. Simultaneous occurrence of reduction in Mahandai streamﬂow and increase in extreme drought in Orissa meteorological subdivision is likely to pose a challenge for water resources engineers in meeting water demands in future. Climate Change Microclimatology General Circulation Models (GCMs) Climate - Circulation Model Climate Change - Statistical Methods Climate Impact Assessment Climate Model Atmospheric Circulation Model Modeling GCM Fuzzy Clustering Streamflow Prediction Kernel Functions Vector Machine Climatology
67	Tuned and asynchronous stencil kernels for CPU/GPU systems Venkatasubramanian, Sundaresan 18 May 2009 (has links) We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060. Motivated to find a still faster implementation, we further consider "wildly asynchronous" implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations, thereby potentially trading off more flops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly "fast-and-loose" algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs. Hybrid High performance computing Architecture Chaotic relaxation Tesla Linear system of equations Numerical methods Occupancy Algorithms Experimentation Performance Scientific computing Gauss siedel Shared memory Coalesced memory Bank conflicts GPU CUDA Nvidia Heterogenous CPU Iterative methods (Mathematics) Kernel functions
68	The asymptotic stability of stochastic kernel operators Brown, Thomas John 06 1900 (has links) A stochastic operator is a positive linear contraction, P : L1 --+ L1, such that llPfII2 = llfll1 for f > 0. It is called asymptotically stable if the iterates pn f of each density converge in the norm to a fixed density. Pf(x) = f K(x,y)f(y)dy, where K( ·, y) is a density, defines a stochastic kernel operator. A general probabilistic/ deterministic model for biological systems is considered. This leads to the LMT operator P f(x) = Jo - Bx H(Q(>.(x)) - Q(y)) dy, where -H'(x) = h(x) is a density. Several particular examples of cell cycle models are examined. An operator overlaps supports iffor all densities f,g, pn f APng of 0 for some n. If the operator is partially kernel, has a positive invariant density and overlaps supports, it is asymptotically stable. It is found that if h( x) > 0 for x ~ xo ~ 0 and ["'" x"h(x) dx < liminf(Q(A(x))" - Q(x)") for a E (0, 1] lo x-oo then P is asymptotically stable, and an opposite condition implies P is sweeping. Many known results for cell cycle models follow from this. / Mathematical Science / M. Sc. (Mathematics) Markov operator Stochastic operator Asymptotic stability Ergodic theory Biological models Cell cycle models Kernel operations Doubly stochastic operators Harris operators Stochastic process 515.7246 Kernel functions Operator equations -- Asymptotic theory Ergodic theory Cell cycle Stochastic processes Random operators
69	Modelo de predição para análise comparativa de técnicas Neuro-Fuzzy e de Regressão Oliveira, Alessandro Bertolani 12 February 2010 (has links) Made available in DSpace on 2016-12-23T14:33:42Z (GMT). No. of bitstreams: 1 Dissertacao de Alexandre Bertolani Oliveira.pdf: 2765651 bytes, checksum: d31c448c5c2d094b1f5f76cb6c10e190 (MD5) Previous issue date: 2010-02-12 / We investigate strategies to define prediction models for a quality parameter of an industrial process. We estimate this variable using computational intelligence and in special regression methods. The main contribution of this paper is the comparative analysis of heuristic training models to create the prediction system. We propose two main paradigms to obtain the system, machine learning and hybrid artificial neural networks. The resulting system is a prototype for the intelligent supervision of a real-time production process. Statistical tools are used to compare the performance of the regression based predictor and the neuro-fuzzy based predictor, considering the degree of adaptation of the system to the problem and its generalization ability / Neste trabalho são investigadas estratégias para a elaboração de Modelos de Predição que possam ser utilizados no monitoramento de uma variável de qualidade pertencente a um determinado Processo Produtivo Industrial. Neste cenário, a variável de qualidade é estimada por meio de técnicas da Inteligência Computacional e empiricamente avaliada na resolução de problemas de regressão. A principal contribuição desta monografia é a análise comparativa de Técnicas da Inteligência Computacional associadas às estratégias heurísticas de treinamento para a construção dos Modelos de Predição. São propostas duas linhas de pesquisa investigadas a partir de uma pesquisa empírica dos dados, e analisados a partir de dois grandes ramos da Inteligência Computacional Aprendizagem de Máquina e Redes Neurais Híbridas. Os Modelos de Predição desenvolvidos são protótipos conceituais para potencial implementação de Sistemas Inteligentes em tempo real de uma planta industrial. O método de construção dos Modelos de Predição por técnicas de Regressão é comparado com o método de construção do Modelo de Predição por redes Neuro-Fuzzy e analisados por critérios estabelecidos a partir de ferramentas estatísticas que levam em consideração os níveis de adequação e generalização dos mesmos. Ao final, são apresentados resultados dos métodos implementados sobre a mesma base de dados bem como os pertinentes trabalhos futuros Análise de Regressão Funções de Kernel Inteligência Computacional Redes neurais (computação) Teoria de previsão Regression analysis Kernel functions Computational intelligence Neural networks (Computer science) Stochastic processes
70	Novel measures on directed graphs and applications to large-scale within-network classification Mantrach, Amin 25 October 2010 (has links) Ces dernières années, les réseaux sont devenus une source importante d’informations dans différents domaines aussi variés que les sciences sociales, la physique ou les mathématiques. De plus, la taille de ces réseaux n’a cessé de grandir de manière conséquente. Ce constat a vu émerger de nouveaux défis, comme le besoin de mesures précises et intuitives pour caractériser et analyser ces réseaux de grandes tailles en un temps raisonnable.<p>La première partie de cette thèse introduit une nouvelle mesure de similarité entre deux noeuds d’un réseau dirigé et pondéré :la covariance “sum-over-paths”. Celle-ci a une interprétation claire et précise :en dénombrant tous les chemins possibles deux noeuds sont considérés comme fortement corrélés s’ils apparaissent souvent sur un même chemin – de préférence court. Cette mesure dépend d’une distribution de probabilités, définie sur l’ensemble infini dénombrable des chemins dans le graphe, obtenue en minimisant l'espérance du coût total entre toutes les paires de noeuds du graphe sachant que l'entropie relative totale injectée dans le réseau est fixée à priori. Le paramètre d’entropie permet de biaiser la distribution de probabilité sur un large spectre :allant de marches aléatoires naturelles où tous les chemins sont équiprobables à des marches biaisées en faveur des plus courts chemins. Cette mesure est alors appliquée à des problèmes de classification semi-supervisée sur des réseaux de taille moyennes et comparée à l’état de l’art.<p>La seconde partie de la thèse introduit trois nouveaux algorithmes de classification de noeuds en sein d’un large réseau dont les noeuds sont partiellement étiquetés. Ces algorithmes ont un temps de calcul linéaire en le nombre de noeuds, de classes et d’itérations, et peuvent dés lors être appliqués sur de larges réseaux. Ceux-ci ont obtenus des résultats compétitifs en comparaison à l’état de l’art sur le large réseaux de citations de brevets américains et sur huit autres jeux de données. De plus, durant la thèse, nous avons collecté un nouveau jeu de données, déjà mentionné :le réseau de citations de brevets américains. Ce jeu de données est maintenant disponible pour la communauté pour la réalisation de tests comparatifs.<p>La partie finale de cette thèse concerne la combinaison d’un graphe de citations avec les informations présentes sur ses noeuds. De manière empirique, nous avons montré que des données basées sur des citations fournissent de meilleurs résultats de classification que des données basées sur des contenus textuels. Toujours de manière empirique, nous avons également montré que combiner les différentes sources d’informations (contenu et citations) doit être considéré lors d’une tâche de classification de textes. Par exemple, lorsqu’il s’agit de catégoriser des articles de revues, s’aider d’un graphe de citations extrait au préalable peut améliorer considérablement les performances. Par contre, dans un autre contexte, quand il s’agit de directement classer les noeuds du réseau de citations, s’aider des informations présentes sur les noeuds n’améliora pas nécessairement les performances.<p>La théorie, les algorithmes et les applications présentés dans cette thèse fournissent des perspectives intéressantes dans différents domaines.<p><p><p>In recent years, networks have become a major data source in various fields ranging from social sciences to mathematical and physical sciences. Moreover, the size of available networks has grow substantially as well. This has brought with it a number of new challenges, like the need for precise and intuitive measures to characterize and analyze large scale networks in a reasonable time. <p>The first part of this thesis introduces a novel measure between two nodes of a weighted directed graph: The sum-over-paths covariance. It has a clear and intuitive interpretation: two nodes are considered as highly correlated if they often co-occur on the same -- preferably short -- paths. This measure depends on a probability distribution over the (usually infinite) countable set of paths through the graph which is obtained by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. The entropy parameter allows to bias the probability distribution over a wide spectrum: going from natural random walks (where all paths are equiprobable) to walks biased towards shortest-paths. This measure is then applied to semi-supervised classification problems on medium-size networks and compared to state-of-the-art techniques.<p>The second part introduces three novel algorithms for within-network classification in large-scale networks, i.e. classification of nodes in partially labeled graphs. The algorithms have a linear computing time in the number of edges, classes and steps and hence can be applied to large scale networks. They obtained competitive results in comparison to state-of-the-art technics on the large scale U.S.~patents citation network and on eight other data sets. Furthermore, during the thesis, we collected a novel benchmark data set: the U.S.~patents citation network. This data set is now available to the community for benchmarks purposes. <p>The final part of the thesis concerns the combination of a citation graph with information on its nodes. We show that citation-based data provide better results for classification than content-based data. We also show empirically that combining both sources of information (content-based and citation-based) should be considered when facing a text categorization problem. For instance, while classifying journal papers, considering to extract an external citation graph may considerably boost the performance. However, in another context, when we have to directly classify the network citation nodes, then the help of features on nodes will not improve the results.<p>The theory, algorithms and applications presented in this thesis provide interesting perspectives in various fields.<p> / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Informatique générale Sciences exactes et naturelles Network computers Kernel functions Graph theory -- Data processing Markov processes Ordinateurs de réseau Noyaux (Mathématiques) Théorie des graphes -- Informatique Markov, Processus de betweenness centrality large scale graphs semi-supervised classification graph kernels

Search results