1 |
Advancing Chart Question Answering with Robust Chart Component RecognitionZheng, Hanwen 13 August 2024 (has links)
The task of comprehending charts [1, 2, 3] presents significant challenges for machine learning models due to the diverse and intricate shapes of charts. The chart extraction task ensures the precise identification of key components, while the chart question answering (ChartQA) task integrates visual and textual information, facilitating accurate responses to queries based on the chart's content. To approach ChartQA, this research focuses on two main aspects. Firstly, we introduce ChartFormer, an integrated framework that simultaneously identifies and classifies every chart element. ChartFormer extends beyond traditional data visualization by identifying descriptive components such as the chart title, legend, and axes, providing a comprehensive understanding of the chart's content. ChartFormer is particularly effective for complex instance segmentation tasks that involve a wide variety of class objects with unique visual structures. It utilizes an end-to-end transformer architecture, which enhances its ability to handle the intricacies of diverse and distinct object features. Secondly, we present Question-guided Deformable Co-Attention (QDCAt), which facilitates multimodal fusion by incorporating question information into a deformable offset network and enhancing visual representation from ChartFormer through a deformable co-attention block. / Master of Science / Real-world data often encompasses multimodal information, blending textual descriptions with visual representations. Charts, in particular, pose a significant challenge for machine learning models due to their condensed and complex structure. Existing multimodal methods often neglect these graphics, failing to integrate them effectively. To address this gap, we introduce ChartFormer, a unified framework designed to enhance chart understanding through instance segmentation, and a novel Question-guided Deformable Co-Attention (QDCAt) mechanism. This approach seamlessly integrates visual and textual features for chart question answering (ChartQA), allowing for more comprehensive reasoning. ChartFormer excels at identifying and classifying chart components such as bars, lines, pies, titles, legends, and axes. The QDCAt mechanism further enhances multimodal fusion by aligning textual information with visual cues, thereby improving answer accuracy. By dynamically adjusting attention based on the question context, QDCAt ensures that the model focuses on the most relevant parts of the chart. Extensive experiments demonstrate that ChartFormer and QDChart significantly outperform their baseline models in chart component recognition and ChartQA tasks by 3.2% in mAP and 15.4% in accuracy, respectively, providing a robust solution for detailed visual data interpretation across various applications.
These results highlight the efficacy of our approach in providing a robust solution for detailed visual data interpretation, making it applicable to a wide range of domains, from scientific research to financial analysis and beyond.
|
2 |
Real-Time Instance and Semantic Segmentation Using Deep LearningKolhatkar, Dhanvin 10 June 2020 (has links)
In this thesis, we explore the use of Convolutional Neural Networks for semantic and instance segmentation, with a focus on studying the application of existing methods with cheaper neural networks. We modify a fast object detection architecture for the instance segmentation task, and study the concepts behind these modifications both in the simpler context of semantic segmentation and the more difficult context of instance segmentation. Various instance segmentation branch architectures are implemented in parallel with a box prediction branch, using its results to crop each instance's features. We negate the imprecision of the final box predictions and eliminate the need for bounding box alignment by using an enlarged bounding box for cropping. We report and study the performance, advantages, and disadvantages of each. We achieve fast speeds with all of our methods.
|
3 |
Contextual Recurrent Level Set Networks and Recurrent Residual Networks for Semantic LabelingLe, Ngan Thi Hoang 01 May 2018 (has links)
Semantic labeling is becoming more and more popular among researchers in computer vision and machine learning. Many applications, such as autonomous driving, tracking, indoor navigation, augmented reality systems, semantic searching, medical imaging are on the rise, requiring more accurate and efficient segmentation mechanisms. In recent years, deep learning approaches based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have dramatically emerged as the dominant paradigm for solving many problems in computer vision and machine learning. The main focus of this thesis is to investigate robust approaches that can tackle the challenging semantic labeling tasks including semantic instance segmentation and scene understanding. In the first approach, we convert the classic variational Level Set method to a learnable deep framework by proposing a novel definition of contour evolution named Recurrent Level Set (RLS). The proposed RLS employs Gated Recurrent Units to solve the energy minimization of a variational Level Set functional. The curve deformation processes in RLS is formulated as a hidden state evolution procedure and is updated by minimizing an energy functional composed of fitting forces and contour length. We show that by sharing the convolutional features in a fully end-to-end trainable framework, RLS is able to be extended to Contextual Recurrent Level Set (CRLS) Networks to address semantic segmentation in the wild problem. The experimental results have shown that our proposed RLS improves both computational time and segmentation accuracy against the classic variational Level Set-based methods whereas the fully end-to-end system CRLS achieves competitive performance compared to the state-of-the-art semantic segmentation approaches on PAS CAL VOC 2012 and MS COCO 2014 databases. The second proposed approach, Contextual Recurrent Residual Networks (CRRN), inherits all the merits of sequence learning information and residual learning in order to simultaneously model long-range contextual infor- mation and learn powerful visual representation within a single deep network. Our proposed CRRN deep network consists of three parts corresponding to sequential input data, sequential output data and hidden state as in a recurrent network. Each unit in hidden state is designed as a combination of two components: a context-based component via sequence learning and a visualbased component via residual learning. That means, each hidden unit in our proposed CRRN simultaneously (1) learns long-range contextual dependencies via a context-based component. The relationship between the current unit and the previous units is performed as sequential information under an undirected cyclic graph (UCG) and (2) provides powerful encoded visual representation via residual component which contains blocks of convolution and/or batch normalization layers equipped with an identity skip connection. Furthermore, unlike previous scene labeling approaches [1, 2, 3], our method is not only able to exploit the long-range context and visual representation but also formed under a fully-end-to-end trainable system that effectively leads to the optimal model. In contrast to other existing deep learning networks which are based on pretrained models, our fully-end-to-end CRRN is completely trained from scratch. The experiments are conducted on four challenging scene labeling datasets, i.e. SiftFlow, CamVid, Stanford background, and SUN datasets, and compared against various state-of-the-art scene labeling methods.
|
4 |
Improving Automatic Image Annotation Using MetadataWahlquist, Gustav January 2021 (has links)
Detecting and outlining products in images is beneficial for many use cases in e-commerce, such as automatically identifying and locating products within images and proposing matches for the detections. This study investigated how the utilisation of metadata associated with images of products could help boost the performance of an existing approach with the ultimate goal of reducing manual labour needed to annotate images. This thesis explored if approximate pseudo masks could be generated for products in images by leveraging metadata as image-level labels and subsequently using the masks to train a Mask R-CNN. However, this approach did not result in satisfactory results. Further, this study found that by incorporating the metadata directly in the Mask R-CNN, an mAP performance increase of nearly 5\% was achieved. Furthermore, utilising the available metadata to divide the training samples for a KNN model into subsets resulted in an increased top-3 accuracy of up to 16\%. By representing the data with embeddings created by a pre-trained CNN, the KNN model performed better with both higher accuracy and more reasonable suggestions.
|
5 |
Indoor 3D Scene Understanding Using Depth SensorsLahoud, Jean 09 1900 (has links)
One of the main goals in computer vision is to achieve a human-like understanding of images. Nevertheless, image understanding has been mainly studied in the 2D image frame, so more information is needed to relate them to the 3D world. With the emergence of 3D sensors (e.g. the Microsoft Kinect), which provide depth along with color information, the task of propagating 2D knowledge into 3D becomes more attainable and enables interaction between a machine (e.g. robot) and its environment. This dissertation focuses on three aspects of indoor 3D scene understanding: (1) 2D-driven 3D object detection for single frame scenes with inherent 2D information, (2) 3D object instance segmentation for 3D reconstructed scenes, and (3) using room and floor orientation for automatic labeling of indoor scenes that could be used for self-supervised object segmentation. These methods allow capturing of physical extents of 3D objects, such as their sizes and actual locations within a scene.
|
6 |
Exploration of performance evaluation metrics with deep-learning-based generic object detection for robot guidance systemsGustafsson, Helena January 2023 (has links)
Robots are often used within the industry for automated tasks that are too dangerous, complex, or strenuous for humans, which leads to time and cost benefits. Robots can have an arm and a gripper to manipulate the world and sensors for eyes to be able to perceive the world. Human vision can be seen as an effortless task, but machine vision requires substantial computation in an attempt to be as effective as human vision. Visual object recognition is a common goal for machine vision, and it is often applied using deep learning and generic object detection. This thesis has a focus on robot guidance systems that include a robot with its gripper on the robot arm, a camera that acquires images of the world, boxes to detect in one or more layers, and the software that applies a generic object detection model to detect the boxes. Robot guidance systems’ performance is impacted by many variables such as different environmental, camera, object, and robot gripper aspects. A survey was constructed to receive feedback from professionals on what thresholds that can be defined for detection from the model to be counted as correct, with the aspect of the detection referring to an actual object that needs to be able to be picked up by a robot. This thesis has implemented precision, recall, average precision at a specific threshold, average precision at a range of thresholds, localization-recall-precision error, and a manually constructed counter based on survey results for the robot’s ability to pick up an object from the information provided by the detection, called pickability score. The metrics from this thesis are implemented within a tool intended for analyzing different models’ performance on varying datasets. The values of all the metrics for the applied dataset are presented in the results. The metrics are discussed with regards to what information they portray together with a robot guidance system. The conclusion is to see the metrics for what they are best at by themselves. Use the average precision metrics for the performance evaluation of the models, and the pickability scores with extended features for the robot gripper pickability evaluation.
|
7 |
Computer Vision Approaches for Mapping Gene Expression onto Lineage TreesLalit, Manan 06 December 2022 (has links)
This project concerns studying the early development of living organisms. This period is accompanied by dynamic morphogenetic events. There is an increase in the number of cells, changes in the shape of cells and specification of cell fate during this time. Typically, in order to capture the dynamic morphological changes, one can employ a form of microscopy imaging such as Selective Plane Illumination Microscopy (SPIM) which offers a single-cell resolution across time, and hence allows observing the positions, velocities and trajectories of most cells in a developing embryo. Unfortunately, the dynamic genetic activity which underlies these morphological changes and influences cellular fate decision, is captured only as static snapshots and often requires processing (sequencing or imaging) multiple distinct individuals. In order to set the stage for characterizing the factors which influence cellular fate, one must bring the data arising from the above-mentioned static snapshots of multiple individuals and the data arising from SPIM imaging of other distinct individual(s) which characterizes the changes in morphology, into the same frame of reference.
In this project, a computational pipeline is established, which achieves the aforementioned goal of mapping data from these various imaging modalities and specimens to a canonical frame of reference. This pipeline relies on the three core building blocks of Instance Segmentation, Tracking and Registration. In this dissertation work, I introduce EmbedSeg which is my solution to performing instance segmentation of 2D and 3D (volume) image data. Next, I introduce LineageTracer which is my solution to performing tracking of a time-lapse (2d+t, 3d+t) recording. Finally, I introduce PlatyMatch which is my solution to performing registration of volumes. Errors from the application of these building blocks accumulate which produces a noisy observation estimate of gene expression for the digitized cells in the canonical frame of reference. These noisy estimates are processed to infer the underlying hidden state by using a Hidden Markov Model (HMM) formulation. Lastly, for wider dissemination of these methods, one requires an effective visualization strategy. A few details about the employed approach are also discussed in the dissertation work.
The pipeline was designed keeping imaging volume data in mind, but can easily be extended to incorporate other data modalities, if available, such as single cell RNA Sequencing (scRNA-Seq) (more details are provided in the Discussion chapter). The methods elucidated in this dissertation would provide a fertile playground for several experiments and analyses in the future. Some of such potential experiments and current weaknesses of the computational pipeline are also discussed additionally in the Discussion Chapter.
|
8 |
Enhancing Athletic Training Through AI: A Comparative Analysis Of YOLO Versions For Image Segmentation In Velocity-Based TrainingÅgren, Oscar, Palm, Johan January 2024 (has links)
This work explores the application of Artificial Intelligence (AI) in sports, specifically comparing. You Only Look Once (YOLO) version 8 and version 9 models in the context of Velocity-Based Training and resistance training. It aims to evaluate the models’ performance in instance segmentation and their effectiveness in estimating velocity metrics. Additionally, methods for pixel to meter conversion and centroid selection on barbells are developed and discussed. The field of AI is growing vastly with great practical possibilities in the sports industry. Traditional methods of collecting and analyzing data involving sensors are often expensive and not available to many coaches and athletes. By leveraging AI techniques, this work aims to provide insights to more cost-effective solutions. An experiment was conducted where YOLOv8 and YOLOv9 models of different sizes were trained on a custom dataset. Using the resulting model weights, key Velocity-based Training (VBT) metrics were extracted from videos of squat, bench press and deadlift exercises, and compared with sensor data. To automatically track the barbell in the videos, the centroids of bounding boxes were used. Additionally, to acquire the velocity in meters per second, pixel-to-meter conversion ratios were obtained using the Circular Hough Transform. Findings indicate that the YOLOv8x model generally excels according to performance metrics, however recording high mean inference time. Additionally, the YOLOv8m model showed overestimation in mean velocity, peak velocity and range of motion highlighting potential challenges for real-time VBT applications. Otherwise, all models performed very similar to sensor data, occasionally differing in scale stemming from faulty pixel to meter conversions. In conclusion, this work underscores AI’s potential in the sports industry while identifying areas for further enhancement to ensure accuracy and reliability in applications.
|
9 |
Salient object detection and segmentation in videos / Détection d'objets saillants et segmentation dans des vidéosWang, Qiong 09 May 2019 (has links)
Cette thèse est centrée sur le problème de la détection d'objets saillants et de leur segmentation dans une vidéo en vue de détecter les objets les plus attractifs ou d'affecter des identités cohérentes d'objets à chaque pixel d'une séquence vidéo. Concernant la détection d'objets saillants dans vidéo, outre une revue des techniques existantes, une nouvelle approche et l'extension d'un modèle sont proposées; de plus une approche est proposée pour la segmentation d'instances d'objets vidéo. Pour la détection d'objets saillants dans une vidéo, nous proposons : (1) une approche traditionnelle pour détecter l'objet saillant dans sa totalité à l'aide de la notion de "bordures virtuelles". Un filtre guidé est appliqué sur la sortie temporelle pour intégrer les informations de bord spatial en vue d'une meilleure détection des bords de l'objet saillants. Une carte globale de saillance spatio-temporelle est obtenue en combinant la carte de saillance spatiale et la carte de saillance temporelle en fonction de l'entropie. (2) Une revue des développements récents des méthodes basées sur l'apprentissage profond est réalisée. Elle inclut les classifications des méthodes de l'état de l'art et de leurs architectures, ainsi qu'une étude expérimentale comparative de leurs performances. (3) Une extension d'un modèle de l'approche traditionnelle proposée en intégrant un procédé de détection d'objet saillant d'image basé sur l'apprentissage profond a permis d'améliorer encore les performances. Pour la segmentation des instances d'objets dans une vidéo, nous proposons une approche d'apprentissage profond dans laquelle le calcul de la confiance de déformation détermine d'abord la confiance de la carte masquée, puis une sélection sémantique est optimisée pour améliorer la carte déformée, où l'objet est réidentifié à l'aide de l'étiquettes sémantique de l'objet cible. Les approches proposées ont été évaluées sur des jeux de données complexes et de grande taille disponibles publiquement et les résultats expérimentaux montrent que les approches proposées sont plus performantes que les méthodes de l'état de l'art. / This thesis focuses on the problem of video salient object detection and video object instance segmentation which aim to detect the most attracting objects or assign consistent object IDs to each pixel in a video sequence. One approach, one overview and one extended model are proposed for video salient object detection, and one approach is proposed for video object instance segmentation. For video salient object detection, we propose: (1) one traditional approach to detect the whole salient object via the adjunction of virtual borders. A guided filter is applied on the temporal output to integrate the spatial edge information for a better detection of the salient object edges. A global spatio-temporal saliency map is obtained by combining the spatial saliency map and the temporal saliency map together according to the entropy. (2) An overview of recent developments for deep-learning based methods is provided. It includes the classifications of the state-of-the-art methods and their frameworks, and the experimental comparison of the performances of the state-of-the-art methods. (3) One extended model further improves the performance of the proposed traditional approach by integrating a deep-learning based image salient object detection method For video object instance segmentation, we propose a deep-learning approach in which the warping confidence computation firstly judges the confidence of the mask warped map, then a semantic selection is introduced to optimize the warped map, where the object is re-identified using the semantics labels of the target object. The proposed approaches have been assessed on the published large-scale and challenging datasets. The experimental results show that the proposed approaches outperform the state-of-the-art methods.
|
10 |
Automatic classification of fish and bubbles at pixel-level precision in multi-frequency acoustic echograms using U-Net convolutional neural networksSlonimer, Alex 05 April 2022 (has links)
Multi-frequency backscatter acoustic profilers (echosounders) are used to measure biological and physical phenomena in the ocean in ways that are not possible with optical methods. Echosounders are commonly used on ocean observatories and by commercial fisheries but require significant manual effort to classify species of interest within the collected echograms. The work presented in this thesis tackles the challenging task of automating the identification of fish and other phenomena in echosounder data, with specific application to aggregations of juvenile salmon, schools of herring, and bubbles of air that have been mixed into the water.
U-Net convolutional neural networks (CNNs) are used to accomplish this task by identifying classes at the pixel level. The data considered here were collected in Okisollo Channel on the coast of British Columbia, Canada, using an Acoustic Zooplankton and Fish Profiler at four frequencies (67.5, 125, 200, and 455 kHz). The entrainment of air bubbles and the behaviour of fish are both governed by the surrounding physical environment. To improve the classification, simulated channels for water depth and solar elevation angle (a proxy for sunlight) are used to encode the CNNs with information related to the environment providing spatial and temporal context. The manual annotation of echograms at the pixel level is a challenging process, and a custom application was developed to aid in this process. A relatively small set of annotations were created and are used to train the CNNs. During training, the echogram data are divided into randomly-spaced square tiles to encode the models with robust features, and into overlapping tiles for added redundancy during classification. This is done without removing noise in the data, thus ensuring broad applicability. This approach is proven highly successful, as evidenced by the best-performing U-Net model producing F1 scores of 93.0%, 87.3% and 86.5% for herring, salmon, and bubble classes, respectively. These models also achieve promising results when applied to echogram data with coarser resolution.
One goal in fisheries acoustics is to detect distinct schools of fish. Following the initial pixel level classification, the results from the best performing U-Net model are fed through a heuristic module, inspired by traditional fisheries methods, that links connected components of identified fish (school candidates) into distinct school objects. The results are compared to the outputs from a recent study that relied on a Mask R-CNN architecture to apply instance segmentation for classifying fish schools. It is demonstrated that the U-Net/heuristic hybrid technique improves on the Mask R-CNN approach by a small amount for the classification of herring schools, and by a large amount for aggregations of juvenile salmon (improvement in mean average precision from 24.7% to 56.1%). / Graduate
|
Page generated in 0.1122 seconds