Spelling suggestions: "subject:"scene representation"" "subject:"acene representation""
1 |
Shape Recipes: Scene Representations that Refer to the ImageFreeman, William T., Torralba, Antonio 01 September 2002 (has links)
The goal of low-level vision is to estimate an underlying scene, given an observed image. Real-world scenes (e.g., albedos or shapes) can be very complex, conventionally requiring high dimensional representations which are hard to estimate and store. We propose a low-dimensional representation, called a scene recipe, that relies on the image itself to describe the complex scene configurations. Shape recipes are an example: these are the regression coefficients that predict the bandpassed shape from bandpassed image data. We describe the benefits of this representation, and show two uses illustrating their properties: (1) we improve stereo shape estimates by learning shape recipes at low resolution and applying them at full resolution; (2) Shape recipes implicitly contain information about lighting and materials and we use them for material segmentation.
|
2 |
Recognizing Indoor ScenesTorralba, Antonio, Sinha, Pawan 25 July 2001 (has links)
We propose a scheme for indoor place identification based on the recognition of global scene views. Scene views are encoded using a holistic representation that provides low-resolution spatial and spectral information. The holistic nature of the representation dispenses with the need to rely on specific objects or local landmarks and also renders it robust against variations in object configurations. We demonstrate the scheme on the problem of recognizing scenes in video sequences captured while walking through an office environment. We develop a method for distinguishing between 'diagnostic' and 'generic' views and also evaluate changes in system performances as a function of the amount of training data available and the complexity of the representation.
|
3 |
Properties and Applications of Shape RecipesTorralba, Antonio, Freeman, William T. 01 December 2002 (has links)
In low-level vision, the representation of scene properties such as shape, albedo, etc., are very high dimensional as they have to describe complicated structures. The approach proposed here is to let the image itself bear as much of the representational burden as possible. In many situations, scene and image are closely related and it is possible to find a functional relationship between them. The scene information can be represented in reference to the image where the functional specifies how to translate the image into the associated scene. We illustrate the use of this representation for encoding shape information. We show how this representation has appealing properties such as locality and slow variation across space and scale. These properties provide a way of improving shape estimates coming from other sources of information like stereo.
|
4 |
Prioritized 3d Scene Reconstruction And Rate-distortion Efficient Representation For Video SequencesImre, Evren 01 August 2007 (has links) (PDF)
In this dissertation, a novel scheme performing 3D reconstruction of a scene from a 2D video sequence is presented. To this aim, first, the trajectories of the salient features in the scene are determined as a sequence of displacements via Kanade-Lukas-Tomasi tracker and Kalman filter. Then, a tentative camera trajectory with respect to a metric reference reconstruction is estimated. All frame pairs are ordered with respect to their amenability to 3D reconstruction by a metric that utilizes the baseline distances and the number of tracked correspondences between the frames. The ordered frame pairs are processed via a sequential structure-from-motion algorithm to estimate the sparse structure and camera matrices. The metric and the associated reconstruction algorithm are shown to outperform their counterparts in the literature via experiments. Finally, a mesh-based, rate-distortion efficient representation is constructed through a novel procedure driven
by the error between a target image, and its prediction from a reference image and the current mesh. At each iteration, the triangular patch, whose projection on the predicted image has the largest error, is identified. Within this projected region
and its correspondence on the reference frame, feature matches are extracted. The pair with the least conformance to the planar model is used to determine the vertex to be added to the mesh. The procedure is shown to outperform the dense depth-map representation in all tested cases, and the block motion vector representation, in scenes with large depth range, in rate-distortion sense.
|
5 |
Leveraging foundation models towards semantic world representations for roboticsKuwajerwala, Alihusein 06 1900 (has links)
Un défi central en robotique est la construction de représentations du monde exploitables. Pour accomplir des tâches complexes, les robots doivent construire une représentation 3D de leur environnement qui représente les informations géométriques, visuelles et sémantiques de la scène, et qui est efficace à utiliser. Les approches existantes encodent les informations sémantiques en utilisant un ensemble (fini) d’étiquettes de classes sémantiques, tels que “personne” et “chaise”. Cependant, pour des instructions ambiguës données à un robot, telles que “apporte-moi une collation saine”, cette approche est insuffisante. En conséquence, des travaux récents ont exploité de grands réseaux de neurones pré-entraînés appelés “modèles de fondation”, dont les représentations latentes apprises offrent plus de flexibilité que les étiquettes de classe, mais ces approches peuvent être inefficaces.
Dans ce travail, nous construisons des représentations de scènes 3D qui tirent parti des modèles de fondation pour encoder la sémantique, permettant des requêtes à vocabulaire ouvert et multimodales, tout en restant évolutives et efficaces. Nous présentons initialement ConceptFusion, qui construit des cartes 3D à vocabulaire ouvert en assignant à chaque point 3D un vecteur de caractéristiques qui encode la sémantique, permettant des requêtes nuancées et multimodales, mais à un coût de mémoire élevé. Nous présentons ensuite ConceptGraphs, qui s’appuie sur l’approche précédente avec une structure de graphe de scène qui assigne des vecteurs de caractéristiques sémantiques aux objets au lieu des points, augmentant ainsi l’efficacité, tout en permettant la planification sur le graphe de scène construit. Les deux systèmes ne nécessitent pas d’entraînement supplémentaire ni de réglage fin des modèles, mais permettent aux robots d’effectuer des tâches de recherche et de navigation inédites, comme le montrent nos expériences dans le monde réel. / A central challenge in robotics is building actionable world representations. To perform complex tasks, robots need to build a 3D representation of their environment that represents the geometric, visual, and semantic information of the scene, and is efficient to use. Existing approaches encode semantic information using a (finite) set of semantic class labels, such as “person” and “chair”. However, for ambiguous instructions to a robot, such as “get me a healthy snack”, this approach is insufficient. As a result, recent works have leveraged large pre-trained neural networks called “foundation models”, whose learned latent representations offer more flexibility than class labels, but these approaches can be inefficient. For example, they may require prohibitive amounts of video memory, or an inability to edit the map.
In this work, we construct 3D scene representations that leverage foundation models to encode semantics, allowing for open-vocabulary and multimodal queries, while still being scalable and efficient. We initially present ConceptFusion, which builds open-vocabulary 3D maps by assigning each 3D point a feature vector that encodes semantics, enabling nuanced and multimodal queries, but at high memory cost. We then present ConceptGraphs, which builds upon the previous approach with a scene graph structure that assigns semantic feature vectors to objects instead of points, increasing efficiency, while also enabling planning over the constructed scene graph. Both systems do not require any additional training or fine-tuning of models, yet enable novel search and navigation tasks to be performed by robots, as shown by our real world experiments.
|
6 |
The Stixel WorldPfeiffer, David 31 August 2012 (has links)
Die Stixel-Welt ist eine neuartige und vielseitig einsetzbare Zwischenrepräsentation zur effizienten Beschreibung dreidimensionaler Szenen. Heutige stereobasierte Sehsysteme ermöglichen die Bestimmung einer Tiefenmessung für nahezu jeden Bildpunkt in Echtzeit. Das erlaubt zum einen die Anwendung neuer leistungsfähiger Algorithmen, doch gleichzeitig steigt die zu verarbeitende Datenmenge und der dadurch notwendig werdende Aufwand massiv an. Gerade im Hinblick auf die limitierte Rechenleistung jener Systeme, wie sie in der videobasierten Fahrerassistenz zum Einsatz kommen, ist dies eine große Herausforderung. Um dieses Problem zu lösen, bietet die Stixel-Welt eine generische Abstraktion der Rohdaten des Sensors. Jeder Stixel repräsentiert individuell einen Teil eines Objektes im Raum und segmentiert so die Umgebung in Freiraum und Objekte. Die Arbeit stellt die notwendigen Verfahren vor, um die Stixel-Welt mittels dynamischer Programmierung in einem einzigen globalen Optimierungsschritt in Echtzeit zu extrahieren. Dieser Prozess wird durch eine Vielzahl unterschiedlicher Annahmen über unsere von Menschenhand geschaffene Umgebung gestützt. Darauf aufbauend wird ein Kalmanfilter-basiertes Verfahren zur präzisen Bewegungsschätzung anderer Objekte vorgestellt. Die Arbeit stellt umfangreiche Bewertungen der zu erwartenden Leistungsfähigkeit aller vorgestellten Verfahren an. Dafür kommen sowohl vergleichende Ansätze als auch diverse Referenzsensoren, wie beispielsweise LIDAR, RADAR oder hochpräzise Inertialmesssysteme, zur Anwendung. Die Stixel-Welt ist eine extrem kompakte Abstraktion der dreidimensionalen Umgebung und bietet gleichzeitig einfachsten Zugriff auf alle essentiellen Informationen der Szene. Infolge dieser Arbeit war es möglich, die Effizienz vieler auf der Stixel-Welt aufbauender Algorithmen deutlich zu verbessern. / The Stixel World is a novel and versatile medium-level representation to efficiently bridge the gap between pixel-based processing and high-level vision. Modern stereo matching schemes allow to obtain a depth measurement for almost every pixel of an image in real-time, thus allowing the application of new and powerful algorithms. However, it also results in a large amount of measurement data that has to be processed and evaluated. With respect to vision-based driver assistance, these algorithms are executed on highly integrated low-power processing units that leave no room for algorithms with an intense calculation effort. At the same time, the growing number of independently executed vision tasks asks for new concepts to manage the resulting system complexity. These challenges are tackled by introducing a pre-processing step to extract all required information in advance. Each Stixel approximates a part of an object along with its distance and height. The Stixel World is computed in a single unified optimization scheme. Strong use is made of physically motivated a priori knowledge about our man-made three-dimensional environment. Relying on dynamic programming guarantees to extract the globally optimal segmentation for the entire scenario. Kalman filtering techniques are used to precisely estimate the motion state of all tracked objects. Particular emphasis is put on a thorough performance evaluation. Different comparative strategies are followed which include LIDAR, RADAR, and IMU reference sensors, manually created ground truth data, and real-world tests. Altogether, the Stixel World is ideally suited to serve as the basic building block for today''s increasingly complex vision systems. It is an extremely compact abstraction of the actual world giving access to the most essential information about the current scenario. Thanks to this thesis, the efficiency of subsequently executed vision algorithms and applications has improved significantly.
|
Page generated in 0.1134 seconds