Generating artificial data for the evaluation of concept learning algorithms

Hunniford, Thomas J. C. January 1998
No description available.

Méthodes probabilistes pour l'analyse des algorithmes sur les tesselations aléatoires / Probabilistic methods for the analysis of algorithms on random tessellations

Hemsley, Ross 16 December 2014
Dans cette thèse, nous exploitons les outils de la théorie des probabilités et de la géométrie stochastique pour analyser des algorithmes opérant sur les tessellations. Ce travail est divisé entre deux thèmes principaux, le premier traite de la navigation dans une tessellation de Delaunay et dans son dual, le diagramme de Voronoï avec des implications pour les algorithmes de localisation spatiales et de routage dans les réseaux en ligne. Nous proposons deux nouveaux algorithmes de navigation dans la triangulation de Delaunay, que nous appelons Pivot Walk et Cone Walk. Pour Cone Walk, nous fournissons une analyse en moyenne détaillée avec des bornes explicites sur les propriétés de la pire marche possible effectuée par l'algorithme sur une triangulation de Delaunay aléatoire d'une région convexe bornée. C'est un progrès significatif car dans l'algorithme Cone Walk, les probabilités d'utiliser un triangle ou un autre au cours de la marche présentent des dépendances complexes, dépendances inexistantes dans d'autres marches. La deuxième partie de ce travail concerne l'étude des propriétés extrémales de tessellations aléatoires. En particulier, nous dérivons les premiers et derniers statistiques d'ordre pour les boules inscrites dans les cellules d'un arrangement de droites Poissonnien; ce résultat a des implications par exemple pour le hachage respectant la localité. Comme corollaire, nous montrons que les cellules minimisant l'aire sont des triangles. / In this thesis, we leverage the tools of probability theory and stochastic geometry to investigate the behavior of algorithms on geometric tessellations of space. This work is split between two main themes, the first of which is focused on the problem of navigating the Delaunay tessellation and its geometric dual, the Voronoi diagram. We explore the applications of this problem to point location using walking algorithms and the study of online routing in networks. We then propose and investigate two new algorithms which navigate the Delaunay triangulation, which we call Pivot Walk and Cone Walk. For Cone Walk, we provide a detailed average-case analysis, giving explicit bounds on the properties of the worst possible path taken by the algorithm on a random Delaunay triangulation in a bounded convex region. This analysis is a significant departure from similar results that have been obtained, due to the difficulty of dealing with the complex dependence structure of localized navigation algorithms on the Delaunay triangulation. The second part of this work is concerned with the study of extremal properties of random tessellations. In particular, we derive the first and last order-statistics for the inballs of the cells in a Poisson line tessellation. This result has implications for algorithms involving line tessellations, such as locality sensitive hashing. As a corollary, we show that the cells minimizing the area are triangles.

Search and broadcast in stochastic environments, a biological perspective / Recherche et diffusion d'informations dans un environnement bruité, une perspective biologique

Boczkowski, Lucas 30 November 2018
Cette thèse s’articule autour de deux séries de travaux motivés par des expériences sur des fourmis. Bien qu’inspirés par labiologie, les modèles que nous développons utilisent une terminologie et une approche typique de l’informatique théorique.Le premier modèle s’inspire du transport collaboratif de nourriture au sein de l’espèce P. Longicornis. Certains aspectsfondamentaux du processus peuvent être décrits par un problème de recherche sur un graphe en présence d’un certain typed’indications bruitées à chaque noeud. Ces indications représentent de courtes traces de phéromones déposées devant l’objettransporté afin de faciliter la navigation. Dans cette thèse, nous donnons une analyse complète du problème lorsque le graphesous-jacent est un arbre, une hypothèse pertinente dans un cadre informatique. En particulier, notre modèle peut être vucomme une généralisation de la recherche binaire aux arbres, en présence de bruit. De manière surprenante, lescomportements des algorithmes optimaux dans ce cadre diffèrent suivant le type de garantie que l’on étudie : convergence enmoyenne ou avec grande probabilité.Le deuxième modèle présenté dans cette thèse a été conçu pour décrire la dissémination d’informations au sein de fourmis dudésert. Dans notre modèle, les échanges ont lieu uniformément au hasard, et sont sujets à du bruit. Nous prouvons une borneinférieure sur le nombre d’interactions requis en fonction de la taille du groupe. La borne, de même que les hypothèses dumodèle, semblent compatible avec les données expérimentales.Une conséquence théorique de ce résultat est une séparation dans ce cadre des variantes PUSH et PULL pour le problème du broadcast avec bruit. Nous étudions aussi une version du problème avec des garanties de convergence plus fortes. Dans cecas, le problème peut-être résolu efficacement, même si les échanges d’information au cours de chaque interaction sont très limités / This thesis is built around two series of works, each motivated by experiments on ants. We derive and analyse new models,that use computer science concepts and methodology, despite their biological roots and motivation.The first model studied in this thesis takes its inspiration in collaborative transport of food in the P. Longicornis species. Wefind that some key aspects of the process are well described by a graph search problem with noisy advice. The advicecorresponds to characteristic short scent marks laid in front of the load in order to facilitate its navigation. In this thesis, weprovide detailed analysis of the model on trees, which are relevant graph structures from a computer science standpoint. Inparticular our model may be viewed as a noisy extension of binary search to trees. Tight results in expectation and highprobability are derived with matching upper and lower bounds. Interestingly, there is a sharp phase transition phenomenon forthe expected runtime, but not when the algorithms are only required to succeed with high probability.The second model we work with was initially designed to capture information broadcast amongst desert ants. The model usesa stochastic meeting pattern and noise in the interactions, in a way that matches experimental data. Within this theoreticalmodel, we present in this document a strong lower bound on the number of interactions required before information can bespread reliably. Experimentally, we see that the time required for the recruitment process of even few ants increases sharplywith the group size, in accordance with our result. A theoretical consequence of the lower bound is a separation between theuniform noisy PUSH and PULL models of interaction. We also study a close variant of broadcast, without noise this time butunder more strict convergence requirements and show that in this case, the problem can be solved efficiently, even with verylimited exchange of information on each interaction.

Design and Analysis of Multidimensional Data Structures

Duch Brown, Amàlia 09 December 2004
Aquesta tesi està dedicada al disseny i a l'anàlisi d'estructures de dades multidimensionals, és a dir, estructures de dades que serveixen per emmagatzemar registres $K$-dimensionals que solen representar-se com a punts en l'espai $[0,1]^K$. Aquestes estructures tenen aplicacions en diverses àrees de la informàtica com poden ser els sistemes d'informació geogràfica, la robòtica, el processament d'imatges, la world wide web, el data mining, entre d'altres. Les estructures de dades multidimensionals també es poden utilitzar com a indexos d'estructures de dades que emmagatzemen, possiblement en memòria externa, dades més complexes que els punts.Les estructures de dades multidimensionals han d'oferir la possibilitat de realitzar operacions d'inserció i esborrat de claus dinàmicament, a més de permetre realitzar cerques anomenades associatives. Exemples d'aquest tipus de cerques són les cerques per rangs ortogonals (quins punts cauen dintre d'un hiper-rectangle donat?) i les cerques del veí més proper (quin és el punt més proper a un punt donat?).Podem dividir les contribucions d'aquesta tesi en dues parts: La primera part està relacionada amb el disseny d'estructures de dades per a punts multidimensionals. Inclou el disseny d'arbres binaris $K$-dimensionals al·leatoritzats (Randomized $K$-d trees), el d'arbres quaternaris al·leatoritzats (Randomized quad trees) i el d'arbres multidimensionals amb punters de referència (Fingered multidimensional trees).La segona part analitza el comportament de les estructures de dades multidimensionals. En particular, s'analitza el cost mitjà de les cerques parcials en arbres $K$-dimensionals relaxats, i el de les cerques per rang en diverses estructures de dades multidimensionals. Respecte al disseny d'estructures de dades multidimensionals, proposem algorismes al·leatoritzats d'inserció i esborrat de registres per als arbres $K$-dimensionals i per als arbres quaternaris. Aquests algorismes produeixen arbres aleatoris, independentment de l'ordre d'inserció dels registres i desprès de qualsevol seqüència d'insercions i esborrats. De fet, el comportament esperat de les estructures produïdes mitjançant els algorismes al·leatoritzats és independent de la distribució de les dades d'entrada, tot i conservant la simplicitat i la flexibilitat dels arbres $K$-dimensionals i quaternaris estàndard. Introduïm també els arbres multidimensionals amb punters de referència. Això permet que les estructures multidimensionals puguin aprofitar l'anomenada localitat de referència en cerques associatives altament correlacionades.I respecte de l'anàlisi d'estructures de dades multidimensionals, primer analitzem el cost esperat de las cerques parcials en els arbres $K$-dimensionals relaxats. Seguidament utilitzem aquest resultat com a base per a l'anàlisi de les cerques per rangs ortogonals, juntament amb arguments combinatoris i geomètrics. D'aquesta manera obtenim un estimat asimptòtic precís del cost de les cerques per rangs ortogonals en els arbres $K$-dimensionals aleatoris. Finalment, mostrem que les tècniques utilitzades es poden estendre fàcilment a d'altres estructures de dades i per tant proporcionem una anàlisi exacta del cost mitjà de cerques per rang en estructures de dades com són els arbres $K$-dimensionals estàndard, els arbres quaternaris, els tries quaternaris i els tries $K$-dimensionals. / Esta tesis está dedicada al diseño y al análisis de estructuras de datos multidimensionales; es decir, estructuras de datos específicas para almacenar registros $K$-dimensionales que suelen representarse como puntos en el espacio $[0,1]^K$. Estas estructuras de datos tienen aplicaciones en diversas áreas de la informática como son: los sistemas de información geográfica, la robótica, el procesamiento de imágenes, la world wide web o data mining, entre otras.Las estructuras de datos multidimensionales suelen utilizarse también como índices de estructuras que almacenan, posiblemente en memoria externa, datos complejos.Las estructuras de datos multidimensionales deben ofrecer la posibilidad de realizar operaciones de inserción y borrado de llaves de manera dinámica, pero además deben permitir realizar búsquedas asociativas en los registros almacenados. Ejemplos de búsquedas asociativas son las búsquedas por rangos ortogonales (¿qué puntos de la estructura de datos están dentro de un hiper-rectángulo dado?) y las búsquedas del vecino más cercano (¿cuál es el punto de la estructura de datos más cercano a un punto dado?).Las contribuciones de esta tesis se dividen en dos partes:La primera parte está dedicada al diseño de estructuras de datos para puntos multidimensionales, que incluye el diseño de los árboles binarios $K$-dimensionales aleatorios (Randomized $K$-d trees), el de los árboles cuaternarios aleatorios (Randomized quad trees), y el de los árboles multidimensionales con punteros de referencia (Fingered multidimensional trees).La segunda parte contiene contribuciones al análisis del comportamiento de las estructuras de datos para puntos multidimensionales. En particular, damos el análisis del costo promedio de las búsquedas parciales en los árboles $K$-dimensionales relajados y el de las búsquedas por rango en varias estructuras de datos multidimensionales.Con respecto al diseño de estructuras de datos multidimensionales, proponemos algoritmos aleatorios de inserción y borrado de registros para los árboles $K$-dimensionales y los árboles cuaternarios que producen árboles aleatorios independientemente del orden de inserción de los registros y después de cualquier secuencia de inserciones y borrados intercalados. De hecho, con la aleatorización garantizamos un buen rendimiento esperado de las estructuras de datos resultantes, que es independiente de la distribución de los datos de entrada, conservando la flexibilidad y la simplicidad de los árboles $K$-dimensionales y de los árboles cuaternarios estándar. También proponemos los árboles multidimensionales con punteros de referencia, una técnica que permite que las estructuras de datos multidimensionales exploten la localidad de referencia en búsquedas asociativas que se presentan altamente correlacionadas.Con respecto al análisis de estructuras de datos multidimensionales, comenzamos dando un análisis preciso del costo esperado de las búsquedas parciales en los árboles $K$-dimensionales relajados. A continuación, utilizamos este resultado como base para el análisis de las búsquedas por rangos ortogonales, combinándolo con argumentos combinatorios y geométricos. Como resultado obtenemos un estimado asintótico preciso del costo de las búsquedas por rango en los árboles $K$-dimensionales relajados. Finalmente, mostramos que las técnicas utilizadas pueden extenderse fácilmente a otras estructuras de datos y por tanto proporcionamos un análisis preciso del costo promedio de búsquedas por rango en estructuras de datos como los árboles $K$-dimensionales estándar, los árboles cuaternarios, los tries cuaternarios y los tries $K$-dimensionales. / This thesis is about the design and analysis of point multidimensional data structures: data structures that store $K$-dimensional keys which we may abstract as points in $[0,1]^K$. These data structures are present in many applications of geographical information systems, image processing or robotics, among others. They are also frequently used as indexes of more complex data structures, possibly stored in external memory.Point multidimensional data structures must have capabilities such as insertion, deletion and (exact) search of items, but in addition they must support the so called {em associative queries}. Examples of these queries are orthogonal range queries (which are the items that fall inside a given hyper-rectangle?) and nearest neighbour queries (which is the closest item to some given point?).The contributions of this thesis are two-fold:Contributions to the design of point multidimensional data structures: the design of randomized $K$-d trees, the design of randomized quad trees and the design of fingered multidimensional search trees;Contributions to the analysis of the performance of point multidimensional data structures: the average-case analysis of partial match queries in relaxed $K$-d trees and the average-case analysis of orthogonal range queries in various multidimensional data structures.Concerning the design of randomized point multidimensional data structures, we propose randomized insertion and deletion algorithms for $K$-d trees and quad trees that produce random $K$-d trees and quad trees independently of the order in which items are inserted into them and after any sequence of interleaved insertions and deletions. The use of randomization provides expected performance guarantees, irrespective of any assumption on the data distribution, while retaining the simplicity and flexibility of standard $K$-d trees and quad trees.Also related to the design of point multidimensional data structures is the proposal of fingered multidimensional search trees, a new technique that enhances point multidimensional data structures to exploit locality of reference in associative queries.With regards to performance analysis, we start by giving a precise analysis of the cost of partial matches in randomized $K$-d trees. We use these results as a building block in our analysis of orthogonal range queries, together with combinatorial and geometric arguments and we provide a tight asymptotic estimate of the cost of orthogonal range search in randomized $K$-d trees. We finally show that the techniques used apply easily to other data structures, so we can provide an analysis of the average cost of orthogonal range search in other data structures such as standard $K$-d trees, quad trees, quad tries, and $K$-d tries.

From Worst-Case to Average-Case Efficiency – Approximating Combinatorial Optimization Problems

Plociennik, Kai 18 February 2011
In theoretical computer science, various notions of efficiency are used for algorithms. The most commonly used notion is worst-case efficiency, which is defined by requiring polynomial worst-case running time. Another commonly used notion is average-case efficiency for random inputs, which is roughly defined as having polynomial expected running time with respect to the random inputs. Depending on the actual notion of efficiency one uses, the approximability of a combinatorial optimization problem can be very different. In this dissertation, the approximability of three classical combinatorial optimization problems, namely Independent Set, Coloring, and Shortest Common Superstring, is investigated for different notions of efficiency. For the three problems, approximation algorithms are given, which guarantee approximation ratios that are unachievable by worst-case efficient algorithms under reasonable complexity-theoretic assumptions. The algorithms achieve polynomial expected running time for different models of random inputs. On the one hand, classical average-case analyses are performed, using totally random input models as the source of random inputs. On the other hand, probabilistic analyses are performed, using semi-random input models inspired by the so called smoothed analysis of algorithms. Finally, the expected performance of well known greedy algorithms for random inputs from the considered models is investigated. Also, the expected behavior of some properties of the random inputs themselves is considered.

From Worst-Case to Average-Case Efficiency – Approximating Combinatorial Optimization Problems: From Worst-Case to Average-Case Efficiency – Approximating Combinatorial Optimization Problems

Plociennik, Kai 27 January 2011
In theoretical computer science, various notions of efficiency are used for algorithms. The most commonly used notion is worst-case efficiency, which is defined by requiring polynomial worst-case running time. Another commonly used notion is average-case efficiency for random inputs, which is roughly defined as having polynomial expected running time with respect to the random inputs. Depending on the actual notion of efficiency one uses, the approximability of a combinatorial optimization problem can be very different. In this dissertation, the approximability of three classical combinatorial optimization problems, namely Independent Set, Coloring, and Shortest Common Superstring, is investigated for different notions of efficiency. For the three problems, approximation algorithms are given, which guarantee approximation ratios that are unachievable by worst-case efficient algorithms under reasonable complexity-theoretic assumptions. The algorithms achieve polynomial expected running time for different models of random inputs. On the one hand, classical average-case analyses are performed, using totally random input models as the source of random inputs. On the other hand, probabilistic analyses are performed, using semi-random input models inspired by the so called smoothed analysis of algorithms. Finally, the expected performance of well known greedy algorithms for random inputs from the considered models is investigated. Also, the expected behavior of some properties of the random inputs themselves is considered.

Average case analysis of algorithms for the maximum subarray problem

Bashar, Mohammad Ehsanul January 2007
Maximum Subarray Problem (MSP) is to find the consecutive array portion that maximizes the sum of array elements in it. The goal is to locate the most useful and informative array segment that associates two parameters involved in data in a 2D array. It's an efficient data mining method which gives us an accurate pattern or trend of data with respect to some associated parameters. Distance Matrix Multiplication (DMM) is at the core of MSP. Also DMM and MSP have the worst-case complexity of the same order. So if we improve the algorithm for DMM that would also trigger the improvement of MSP. The complexity of Conventional DMM is O(n³). In the average case, All Pairs Shortest Path (APSP) Problem can be modified as a fast engine for DMM and can be solved in O(n² log n) expected time. Using this result, MSP can be solved in O(n² log² n) expected time. MSP can be extended to K-MSP. To incorporate DMM into K-MSP, DMM needs to be extended to K-DMM as well. In this research we show how DMM can be extended to K-DMM using K-Tuple Approach to solve K-MSP in O(Kn² log² n log K) time complexity when K ≤ n/log n. We also present Tournament Approach which solves K-MSP in O(n² log² n + Kn²) time complexity and outperforms the K-Tuple

