Return to search

A comparison of image and object level annotation performance of image recognition cloud services and custom Convolutional Neural Network models

Recent advancements in machine learning has contributed to an explosive growth of the image recognition field. Simultaneously, multiple Information Technology (IT) service providers such as Google and Amazon have embraced cloud solutions and software as a service. These factors have helped mature many computer vision tasks from scientific curiosity to practical applications. As image recognition is now accessible to the general developer community, a need arises for a comparison of its capabilities, and what can be gained from choosing a cloud service over a custom implementation. This thesis empirically studies the performance of five general image recognition services (Google Cloud Vision, Microsoft Computer Vision, IBM Watson, Clarifai and Amazon Rekognition) and image recognition models of the Convolutional Neural Network (CNN) architecture that we ourselves have configured and trained. Image and object level annotations of images extracted from different datasets were tested, both in their original state and after being subjected to one of the following six types of distortions: brightness, color, compression, contrast, blurriness and rotation. The output labels and confidence scores were compared to the ground truth of multiple levels of concepts, such as food, soup and clam chowder. The results show that out of the services tested, there is currently no clear top performer over all categories and they all have some variations and similarities in their output, but on average Google Cloud Vision performs the best by a small margin. The services are all adept at identifying high level concepts such as food and most mid-level ones such as soup. However, in terms of further specifics, such as clam chowder, they start to vary, some performing better than others in different categories. Amazon was found to be the most capable at identifying multiple unique objects within the same image, on the chosen dataset. Additionally, it was found that by using synonyms of the ground truth labels, performance increased as the semantic gap between our expectations and the actual output from the services was narrowed. The services all showed vulnerability to image distortions, especially compression, blurriness and rotation. The custom models all performed noticeably worse, around half as well compared to the cloud services, possibly due to the difference in training data standards. The best model, configured with three convolutional layers, 128 nodes and a layer density of two, reached an average performance of almost 0.2 or 20%. In conclusion, if one is limited by a lack of experience with machine learning, computational resources and time, it is recommended to make use of one of the cloud services to reach a more acceptable performance level. Which to choose depends on the intended application, as the services perform differently in certain categories. The services are all vulnerable to multiple image distortions, potentially allowing adversarial attacks. Finally, there is definitely room for improvement in regards to the performance of these services and the computer vision field as a whole.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:bth-18074
Date January 2019
CreatorsNilsson, Kristian, Jönsson, Hans-Eric
PublisherBlekinge Tekniska Högskola, Institutionen för programvaruteknik, Blekinge Tekniska Högskola, Institutionen för programvaruteknik
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0024 seconds