Computer model generates automatic captions for images

by Lindsay Jolivet Learning in Machines & Brains News 31.03.2015
Examples of when the computer model generated an incorrect caption (above) and when it correctly identified the objects.
Examples of when the computer model generated an incorrect caption (above) and when it correctly identified the objects (below).
Image courtesy of CIFAR Senior Fellow Richard Zemel

CIFAR fellows have created a machine learning system that generates captions for images from scratch, scanning scenes and putting together a sentence to describe what it sees.

Caption generation is an example of a fundamental problem of artificial intelligence, one which distinguishes human intelligence – our ability to make sense of our environment — and constructs descriptions that other people can readily understand, according to Richard Zemel (University of Toronto), a CIFAR Senior Fellow in the program in the Learning in Machines & Brains program (formerly known as Neural Computation & Adaptive Perception) (NCAP) and a co-author on the paper.

The ability to generate captions automatically has implications for companies such as Facebook and Google, which need to manage millions of images, but Zemel says it could also be useful for helping blind people learn their surroundings.

Other research in this area has taught computers to describe scenes by matching an image to the correct sentence from a predetermined set, or giving it a sentence and teaching it to pull images from the Internet that match. “Generating captions from scratch is harder,” Zemel says.

The new technique uses an approach for translating languages developed by CIFAR Senior Fellow Yoshua Bengio (University of Montreal) and applies it to a more difficult kind of translation — from images to words. “Instead of it being in French, now it’s in images,” says Zemel.

Along with a team from the University of Toronto consisting of Ryan Kiros, Jimmy Ba and Ruslan Salakhutdinov, also an NCAP fellow, and Université de Montreal’s Kelvin Xu, Kyunghyun Cho, Aaron Courville and Bengio, Zemel developed a model that is special because it can select an area to fix its attention on, examines the area to see what is in it, describes it, and then chooses the next region. This is similar to how humans parse an image, finding the important regions, one at a time, and piecing together the whole scene.

“People have always wanted to put attention in models for two reasons,” Zemel says. “We know humans use selective attention, so one aim is to construct models that embody our understanding of how this works. But also, you want to show that there’s some computational advantage to doing it.”

And there is —their new model works better than those that try to describe an entire image at once. It also learns as it goes along, in a sense, generating its next word based on what it knows about the words that came before. For example, if the computer scans one region of an image and generates the word “boat,” it is much more likely to generate a word such as “water” later in the sentence than, say, “cat,” because it understands that “water” and “boat” appear together much more frequently in language.

The model advances past research on what is called classification, which involves training computers to recognize similar objects – such as cats in the case of Google Brain’s 2012 achievement. “The NCAP program has been very successful at doing classification,” Zemel says. Fellows have won many contests with models that classify images quickly and with a high degree of accuracy.

“To go beyond that we really want to understand what’s in an image. Not just say that there’s a dog in there, but we’d like to be able to describe the whole scene,” Zemel says. One of the next steps is to extend this approach to describe videos, too.

Leave a Comment

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Related Ideas

Learning in Machines & Brains | Case Study

Making artificial intelligence an everyday reality

One of today’s most exciting areas of artificial intelligence research focuses on “deep learning”. This case study outlines the critical...

Humans & the Microbiome | News

Putting the squeeze on gut microbes

Researchers increasingly recognize that the microbes that colonize our guts play an important role in health and development. New research...


Researchers uncover thousands of genes that may be implicated in autism

Researchers have identified 2,500 new genes that may be implicated in autism, using a machine learning technique that analyzed the...

Learning in Machines & Brains | Recommended

Scientific American | Springtime for AI: The Rise of Deep Learning

By Yoshua BengioJune 1 2016 Computers generated a great deal of excitement in the 1950s when they began to beat...

Learning in Machines & Brains | News

Computers recognize memorable images

Why do some images stay fixed in our memories, while others quickly fade away? Researchers have developed a deep learning...

Learning in Machines & Brains | Video

CIFAR – Artificial Intelligence

CIFAR – Artificial Intelligence from CIFAR on Vimeo. CIFAR Distinguished Fellow Geoffrey Hinton, the world’s leading authority on a branch...