Global ETD Search

81	Towards open-world image recognition Saito, Kuniaki 17 September 2024 (has links) Deep neural networks can achieve state-of-the-art performance on various image recognition tasks, such as object categorization (image classification) and object localization (object detection), with the help of a large amount of training data. However, to achieve models that perform well in the real world, we must overcome the shift from training to real-world data, which involves two factors: (1) covariate shift and (2) unseen classes. Covariate shift occurs when the input distribution of a particular category changes from the training time. Deep models can easily make mistakes with a small change in the input, such as small noise addition, lighting change, or changes in the object pose. On the other hand, unseen classes - classes that are absent in the training set - may be present in real-world test samples. It is important to differentiate between "seen" and "unseen" classes in image classification, while locating diverse classes, including classes unseen during training, is crucial in object detection. Therefore, an open-world image recognition model needs to handle both factors. In this thesis, we propose approaches for image classification and object detection that can handle these two kinds of shifts in a label-efficient way. Firstly, we examine the adaptation of large-scale pre-trained models to the object detection task while preserving their robustness to handle covariate shift. We investigate various pre-trained models and discover that the acquisition of robust representations by a trained model depends heavily on the pre-trained model’s architecture. Based on this intuition, we develop simple techniques to prevent the loss of generalizable representations. Secondly, we study the adaptation to an unlabeled target domain for object detection to address the covariate shift. Traditional domain alignment methods may be inadequate due to various factors that cause domain shift between the source and target domains, such as layout and the number of objects in an image. To address this, we propose a strong-weak distribution alignment approach that can handle diverse domain shifts. Furthermore, we study the problem of semi-supervised domain adaptation for image classification when partially labeled target data is available. We introduce a simple yet effective approach, MME, for this task, which extracts discriminative features for the target domain using adversarial learning. We also develop a method to handle the situation where the unlabeled target domain includes categories unseen in the source domain. Since there is no supervision, recognizing instances of unseen classes as "unseen" is challenging. To address this, we devise a straightforward approach that trains a one-vs-all classifier using source data to build a classifier that can detect unseen instances. Additionally, we introduce an approach to enable an object detector to recognize an unseen foreground instance as an "object" using a simple data augmentation and learning framework that is applicable to diverse detectors and datasets. In conclusion, our proposed approaches employ various datasets or architectures due to their simple design and achieve state-of-the-art results. Our work can contribute to the development of a unified open-world image recognition model in future research. Computer science Image classification Object detection Pattern recognition Transfer learning
82	A Machine Learning Approach to Recognize Environmental Features Associated with Social Factors Diaz-Ramos, Jonathan 11 June 2024 (has links) In this thesis we aim to supplement the Climate and Economic Justice Screening Tool (CE JST), which assists federal agencies in identifying disadvantaged census tracts, by extracting five environmental features from Google Street View (GSV) images. The five environmental features are garbage bags, greenery, and three distinct road damage types (longitudinal, transverse, and alligator cracks), which were identified using image classification, object detection, and image segmentation. We evaluate three cities using this developed feature space in order to distinguish between disadvantaged and non-disadvantaged census tracts. The results of the analysis reveal the significance of the feature space and demonstrate the time efficiency, detail, and cost-effectiveness of the proposed methodology. / Master of Science / In this thesis we aim to supplement the Climate and Economic Justice Screening Tool (CE JST), which assists federal agencies in identifying disadvantaged census tracts, by extracting five environmental features from Google Street View (GSV) images. The five environmental features are garbage bags, greenery, and three distinct road damage types (longitudinal, transverse, and alligator cracks), which were identified using image classification, object detection, and image segmentation. We evaluate three cities using this developed feature space in order to distinguish between disadvantaged and non-disadvantaged census tracts. The results of the analysis reveal the significance of the feature space and demonstrate the time efficiency, detail, and cost-effectiveness of the proposed methodology. computer vision object detection image segmentation deep learning
83	Cooperative Perception for Connected Autonomous Vehicle Edge Computing System Chen, Qi 08 1900 (has links) This dissertation first conducts a study on raw-data level cooperative perception for enhancing the detection ability of self-driving systems for connected autonomous vehicles (CAVs). A LiDAR (Light Detection and Ranging sensor) point cloud-based 3D object detection method is deployed to enhance detection performance by expanding the effective sensing area, capturing critical information in multiple scenarios and improving detection accuracy. In addition, a point cloud feature based cooperative perception framework is proposed on edge computing system for CAVs. This dissertation also uses the features' intrinsically small size to achieve real-time edge computing, without running the risk of congesting the network. In order to distinguish small sized objects such as pedestrian and cyclist in 3D data, an end-to-end multi-sensor fusion model is developed to implement 3D object detection from multi-sensor data. Experiments show that by solving multiple perception on camera and LiDAR jointly, the detection model can leverage the advantages from high resolution image and physical world LiDAR mapping data, which leads the KITTI benchmark on 3D object detection. At last, an application of cooperative perception is deployed on edge to heal the live map for autonomous vehicles. Through 3D reconstruction and multi-sensor fusion detection, experiments on real-world dataset demonstrate that a high definition (HD) map on edge can afford well sensed local data for navigation to CAVs. Object Detection Multi-sensor Fusion Connected Autonomous Vehicles Edge Computing
84	Integrating Multiple Deep Learning Models for Disaster Description in Low-Altitude Videos Wang, Haili 12 1900 (has links) Computer vision technologies are rapidly improving and becoming more important in disaster response. The majority of disaster description techniques now focus either on identify objects or categorize disasters. In this study, we trained multiple deep neural networks on low-altitude imagery with highly imbalanced and noisy labels. We utilize labeled images from the LADI dataset to formulate a solution for general problem in disaster classification and object detection. Our research integrated and developed multiple deep learning models that does the object detection task as well as the disaster scene classification task. Our solution is competitive in the TRECVID Disaster Scene Description and Indexing (DSDI) task, demonstrating that it is comparable to other suggested approaches in retrieving disaster-related video clips. Computer Vision Disaster Management Object Detection LADI Dataset
85	Object Detection for Aerial View Images: Dataset and Learning Rate Qi, Yunlong 05 1900 (has links) In recent years, deep learning based computer vision technology has developed rapidly. This is not only due to the improvement of computing power, but also due to the emergence of high-quality datasets. The combination of object detectors and drones has great potential in the field of rescue and disaster relief. We created an image dataset specifically for vision applications on drone platforms. The dataset contains 5000 images, and each image is carefully labeled according to the PASCAL VOC standard. This specific dataset will be very important for developing deep learning algorithms for drone applications. In object detection models, loss function plays a vital role. Considering the uneven distribution of large and small objects in the dataset, we propose adjustment coefficients based on the frequencies of objects of different sizes to adjust the loss function, and finally improve the accuracy of the model. UNT Aerial Dataset Object Detection Learning Rate Engineering, Electronics and Electrical
86	Enhancing Layout Understanding via Human-in-the-Loop: A User Study on PDF-to-HTML Conversion for Long Documents Mao, Chenyu 24 March 2025 (has links) Document layout understanding often utilizes object detection to locate and parse document elements, enabling systems that convert documents into searchable and editable formats to enhance accessibility and usability. Nevertheless, the recognition results often contain errors that require manual correction due to small training dataset size, limitations of models, and defects in training annotations. However, many of these problems can be addressed via human review to improve correctness. We first improved our system by combining the previous Electronic Thesis/Dissertation (ETD) parsing tool and AI-aided annotation tool, providing instant and accurate file output. Then we used our new pipeline to investigate the effectiveness and efficiency of manual correction strategies in improving object detection accuracy through user studies, including 8 participants, comprising a balanced number of four STEM and four non-STEM researchers, all with some background in ETDs. Each participant was assigned correction tasks on a set of ETDs from both STEM and non-STEM disciplines to ensure comprehensive evaluation across different document types. We collected quantitative metrics, such as completion times, accuracy rates, number of wrong labels, and feedback through our post-survey, to assess the usability and performance of the manual correction process and to examine their relationship with users' academic backgrounds. Results demonstrate that manual adjustment significantly enhanced the accuracy of document element identification and classification, with experienced participants achieving superior correction precision. Furthermore, usability feedback revealed a strong correlation between user satisfaction and system design, providing valuable insights for future system enhancement and development. / Master of Science / With the development of technology, there is an increasing demand to make printed and scanned documents more accessible. Organizations such as universities and libraries have millions of valuable documents, including theses, dissertations, and research papers, which exist only in PDF, often as a scanned format. While these works contain valuable knowledge, they can be challenging to search through or access, especially for those with low vision. To solve this problem, we need computer systems that automatically recognize and convert different parts of these documents --- like titles, headings, paragraphs, and figures --- into more usable forms. Our research focuses on improving how these document recognition systems work by combining computer automation with human expertise. While computers can process documents quickly, they sometimes need more training data for complex document layouts. We developed a web-based tool allowing people to review the computer's work and correct errors, such as mislabeled sections or missed elements. We conducted a detailed study with 8 participants who used our correction tool, to understand how effective this human-computer collaboration could be. We carefully measured several aspects of their experience: how many pages they annotated in a fixed amount of time, how accurate their corrections were, and how they felt about using the tool. We also used a post-survey to gather feedback about their experience with the tool. The results were very encouraging. When humans reviewed and corrected the computer's work, the accuracy of document recognition improved significantly. We found that participants could effectively identify and fix errors in the computer's output, especially when the tool was easy to use. Higher user satisfaction was strongly linked to how intuitive and straightforward participants found the correction process. One useful finding was that this process creates a positive feedback loop. Every correction a person makes helps expand the training data available to the computer system, which means the system can learn from these corrections and gradually become better at recognizing similar elements in future documents, reducing the number of errors that need to be corrected over time. Our research offers insights into building advanced object detection systems incorporating computational efficiency with human review. The results boost the formulation of optimal strategies for developing user-centric interfaces and effective document repair operations. This work has practical implications for making academic and research documents more accessible to everyone, including those relying on screen readers or other assistive technologies. This research represents a step forward in making the vast knowledge of digital documents more accessible, searchable, and usable for all readers. By showing how humans and computers can work together effectively, we are helping to build better systems for preserving and sharing knowledge in the digital age. ETD deep learning object detection document layout analysis
87	Sémantický popis obrazovky embedded zařízení / Semantic description of the embedded device screen Horák, Martin January 2020 (has links) Tato diplomová práce se zabývá detekcí prvků uživatelského rozhraní na obrázku displejetiskárny za použití konvolučních neuronových sítí. V teoretické části je provedena rešeršesoučasně používaných architektur pro detekci objektů. V praktické čísti je probrána tvorbagalerie, učení a vyhodnocování vybraných modelů za použití Tensorflow ObjectDetectionAPI. Závěr práce pojednává o vhodnosti vycvičených modelů pro zadaný úkol.
88	Machine vision for automation of earth-moving machines : Transfer learning experiments with YOLOv3 Borngrund, Carl January 2019 (has links) This master thesis investigates the possibility to create a machine vision solution for the automation of earth-moving machines. This research was done as without some type of vision system it will not be possible to create a fully autonomous earth moving machine that can safely be used around humans or other machines. Cameras were used as the primary sensors as they are cheap, provide high resolution and is the type of sensor that most closely mimic the human vision system. The purpose of this master thesis was to use existing real time object detectors together with transfer learning and examine if they can successfully be used to extract information in environments such as construction, forestry and mining. The amount of data needed to successfully train a real time object detector was also investigated. Furthermore, the thesis examines if there are specifically difficult situations for the defined object detector, how reliable the object detector is and finally how to use service-oriented architecture principles can be used to create deep learning systems. To investigate the questions formulated above, three data sets were created where different properties were varied. These properties were light conditions, ground material and dump truck orientation. The data sets were created using a toy dump truck together with a similarly sized wheel loader with a camera mounted on the roof of its cab. The first data set contained only indoor images where the dump truck was placed in different orientations but neither the light nor the ground material changed. The second data set contained images were the light source was kept constant, but the dump truck orientation and ground materials changed. The last data set contained images where all property were varied. The real time object detector YOLOv3 was used to examine how a real time object detector would perform depending on which one of the three data sets it was trained using. No matter the data set, it was possible to train a model to perform real time object detection. Using a Nvidia 980 TI the inference time of the model was around 22 ms, which is more than enough to be able to classify videos running at 30 fps. All three data sets converged to a training loss of around 0.10. The data set which contained more varied data, such as the data set where all properties were changed, performed considerably better reaching a validation loss of 0.164 compared to the indoor data set, containing the least varied data, only reached a validation loss of 0.257. The size of the data set was also a factor in the performance, however it was not as important as having varied data. The result also showed that all three data sets could reach a mAP score of around 0.98 using transfer learning. Machine learning Machine vision YOLOv3 You only look once Computer vision Real time object detection Object detection Computer and Information Sciences Data- och informationsvetenskap
89	Experiential Sampling For Object Detection In Video Paresh, A 05 1900 (has links) The problem of object detection deals with determining whether an instance of a given class of object is present or not. There are robust, supervised learning based algorithms available for object detection in an image. These image object detectors (image-based object detectors) use characteristics learnt from the training samples to find object and non-object regions. The characteristics used are such that the detectors work under a variety of conditions and hence are very robust. Object detection in video can be performed by using such a detector on each frame of the video sequence. This approach checks for presence of an object around each pixel, at different scales. Such a frame-based approach completely ignores the temporal continuity inherent in the video. The detector declares presence of the object independent of what has happened in the past frames. Also, various visual cues such as motion and color, which give hints about the location of the object, are not used. The current work is aimed at building a generic framework for using a supervised learning based image object detector for video that exploits temporal continuity and the presence of various visual cues. We use temporal continuity and visual cues to speed up the detection and improve detection accuracy by considering past detection results. We propose a generic framework, based on Experiential Sampling [1], which considers temporal continuity and visual cues to focus on a relevant subset of each frame. We determine some key positions in each frame, called attention samples, and object detection is performed only at scales with these positions as centers. These key positions are statistical samples from a density function that is estimated based on various visual cues, past experience and temporal continuity. This density estimation is modeled as a Bayesian Filtering problem and is carried out using Sequential Monte Carlo methods (also known as Particle Filtering), where a density is represented by a weighted sample set. The experiential sampling framework is inspired by Neisser’s perceptual cycle [2] and Itti-Koch’s static visual attention model[3]. In this work, we first use Basic Experiential Sampling as presented in[1]for object detection in video and show its limitations. To overcome these limitations, we extend the framework to effectively combine top-down and bottom-up visual attention phenomena. We use learning based detector’s response, which is a top-down cue, along with visual cues to improve attention estimate. To effectively handle multiple objects, we maintain a minimum number of attention samples per object. We propose to use motion as an alert cue to reduce the delay in detecting new objects entering the field of view. We use an inhibition map to avoid revisiting already attended regions. Finally, we improve detection accuracy by using a particle filter based detection scheme [4], also known as Track Before Detect (TBD). In this scheme, we compute likelihood of presence of the object based on current and past frame data. This likelihood is shown to be approximately equal to the product of average sample weights over past frames. Our framework results in a significant reduction in overall computation required by the object detector, with an improvement in accuracy while retaining its robustness. This enables the use of learning based image object detectors in real time video applications which otherwise are computationally expensive. We demonstrate the usefulness of this framework for frontal face detection in video. We use Viola-Jones’ frontal face detector[5] and color and motion visual cues. We show results for various cases such as sequences with single object, multiple objects, distracting background, moving camera, changing illumination, objects entering/exiting the frame, crossing objects, objects with pose variation and sequences with scene change. The main contributions of the thesis are i) We give an experiential sampling formulation for object detection in video. Many concepts like attention point and attention density which are vague in[1] are precisely defined. ii) We combine detector’s response along with visual cues to estimate attention. This is inspired by a combination of top-down and bottom-up attention maps in visual attention models. To the best of our knowledge, this is used for the first time for object detection in video. iii) In case of multiple objects, we highlight the problem with sample based density representation and solve by maintaining a minimum number of attention samples per object. iv) For objects first detected by the learning based detector, we propose to use a TBD scheme for their subsequent detections along with the learning based detector. This improves accuracy compared to using the learning based detector alone. This thesis is organized as follows . Chapter 1: In this chapter we present a brief survey of related work and define our problem. . Chapter 2: We present an overview of biological models that have motivated our work. . Chapter 3: We give the experiential sampling formulation as in previous work [1], show results and discuss its limitations. . Chapter 4: In this chapter, which is on Enhanced Experiential Sampling, we suggest enhancements to overcome limitations of basic experiential sampling. We propose track-before-detect scheme to improve detection accuracy. . Chapter 5: We conclude the thesis and give possible directions for future work in this area. . Appendix A: A description of video database used in this thesis. . Appendix B: A list of commonly used abbreviations and notations. Video Image Processing Sampling Techniques Experiential Sampling Image Object Detectors Video - Object Detection Object Detection Image Object Detector Bayesian Filtering Track Before Detect (TBD) Particle Filtering Applied Optics
90	Automotive 3D Object Detection Without Target Domain Annotations Gustafsson, Fredrik, Linder-Norén, Erik January 2018 (has links) In this thesis we study a perception problem in the context of autonomous driving. Specifically, we study the computer vision problem of 3D object detection, in which objects should be detected from various sensor data and their position in the 3D world should be estimated. We also study the application of Generative Adversarial Networks in domain adaptation techniques, aiming to improve the 3D object detection model's ability to transfer between different domains. The state-of-the-art Frustum-PointNet architecture for LiDAR-based 3D object detection was implemented and found to closely match its reported performance when trained and evaluated on the KITTI dataset. The architecture was also found to transfer reasonably well from the synthetic SYN dataset to KITTI, and is thus believed to be usable in a semi-automatic 3D bounding box annotation process. The Frustum-PointNet architecture was also extended to explicitly utilize image features, which surprisingly degraded its detection performance. Furthermore, an image-only 3D object detection model was designed and implemented, which was found to compare quite favourably with current state-of-the-art in terms of detection performance. Additionally, the PixelDA approach was adopted and successfully applied to the MNIST to MNIST-M domain adaptation problem, which validated the idea that unsupervised domain adaptation using Generative Adversarial Networks can improve the performance of a task network for a dataset lacking ground truth annotations. Surprisingly, the approach did however not significantly improve upon the performance of the image-based 3D object detection models when trained on the SYN dataset and evaluated on KITTI. Object Detection 3D Object Detection Domain Adaptation Generative Adversarial Networks Computer Vision Deep Learning Machine Learning Autonomous Driving Signal Processing Signalbehandling

Search results