• Research Brief
  • Learning in Machines & Brains

A machine learning system generates captions for images from scratch

by CIFAR Feb 11 / 16

Caption generation is a fundamental problem of artificial intelligence, one that distinguishes human intelligence – our ability to construct descriptions that other people can readily understand.


The research aims to incorporate attention into a machine learning system so that it can automatically describe captions for images from scratch, rather than relying on object detection systems.


A disproportionately large portion of our brains is devoted to visual processing. Caption generation is an important challenge for machine learning algorithms as computers must mimic the remarkable human ability to compress huge amounts of visual information into descriptive language. Not only must caption generation models be powerful enough to solve the computer vision challenges of determining which objects are in an image, but they must also be able to capture and express their relationships in a natural language. 

A recent surge of work in this area, particularly in training neural networks and large classification datasets, has significantly improved the quality of caption generation. Encouraged by these advances and recent success in using attention in machine translation and object recognition, this study investigates models that can identify the most important parts of an image while generating its caption. Instead of compressing an entire image into a static representation, attention allows key features to dynamically come to the forefront as needed. is is especially useful for cluttered images.


An attention-based model generates more accurate and descriptive captions for images than those that try to describe an entire image at once. e model can select an area to x its attention on, examine the area to see what is in it, describe it, and then choose the next region. is is similar to how humans parse an image, finding the important regions, one at a time, and piecing together the whole scene. e attention-based approach achieved state-of-the-art performance on three benchmark datasets: Flickr8k with 8,000 images, Flickr30k with 30,000 images and MS COCO with 82,783 images.

The model can be trained to function in a manner similar to human intuition. The model is capable of learning as it goes along, in a sense, generating its next word based on what it knows about the words that came before. Unlike other models, it does not explicitly use object detectors. is makes it more flexible, allowing it to go beyond “objectness” and learn to attend to abstract concepts.

The attention mechanism can enhance understanding of how the network is making its decisions. Models that incorporate an attention mechanism have the ability to visualize what the network “sees” including exactly “where” and “what” it is focusing on. Seeing and understanding how the model is making its decisions and why it may be making mistakes allows for researchers to ne-tune it, which further improves the quality of the captions generated.


The model uses a combination of convolutional neural networks to extract vector representation of images, and recurrent neural networks to decode those representations into natural language sentences. Using recurrent neural networks for machine translation was an approach previously developed by CIFAR Senior Fellow, Yoshua Bengio. Instead of translating from one language to another the focus here is on translating images into words. To incorporate the attention mechanism, the research team trained the model using two attention-based image caption generators under a common framework:

• a “soft” deterministic attention mechanism trainable by standard back-propagation methods; and
• a “hard” stochastic, or probabilistic, attention mechanism trainable by maximizing an approximate variational lower bound or by the REINFORCE learning rule.

The testing of the model used two common metrics in caption generation literature, BLEU (Bilingual Evaluation Understudy) and METEOR (Metric for Evaluation of Translation with Explicit Ordering), evaluate three datasets: Flickr8k with 8,000 images, Flickr30k with 30,000 images and MS COCO with 82,783 images.


The approach taken by this model toward images could be extended to videos. It may also be useful for helping blind people learn their surroundings. It surpasses past research in that not only is it able to classify images quickly and with a high degree of accuracy, it can describe them in much richer and more descriptive language. is has powerful implications for companies such as Facebook and Google, which need to manage millions of images. Search engines such as Google do oer image searchers, but such engines currently do not actually search image content. Instead, they use caption information and textual context to generate search results.


University of Toronto: Richard S. Zemel (CIFAR Senior Fellow), Jimmy Lei Ba, Ryan Kiros, Ruslan Salakhutdino (CIFAR Fellow)

Université de Montréal: Ysoshua Bengio (CIFAR Senior Fellow), Kelvin Xu, Kyunghyun Cho, Aaron Courville


Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” arXiv preprint arXiv:1502.03044 (2015).

Read the full Research Brief