CIFAR fellows have created a machine learning system that generates captions for images from scratch, scanning scenes and putting together a sentence to describe what it sees.
Examples of when the computer model generated an incorrect caption (above) and when it correctly identified the objects (below). Image courtesy of CIFAR Senior Fellow Richard Zemel
Caption generation is an example of a fundamental problem of artificial intelligence, one which distinguishes human intelligence – our ability to make sense of our environment — and constructs descriptions that other people can readily understand, according to Richard Zemel (University of Toronto), a CIFAR Senior Fellow in the program in the Learning in Machines & Brains program (formerly known as Neural Computation & Adaptive Perception) (NCAP) and a co-author on the paper.
The ability to generate captions automatically has implications for companies such as Facebook and Google, which need to manage millions of images, but Zemel says it could also be useful for helping blind people learn their surroundings.
Other research in this area has taught computers to describe scenes by matching an image to the correct sentence from a predetermined set, or giving it a sentence and teaching it to pull images from the Internet that match. “Generating captions from scratch is harder,” Zemel says.
The new technique uses an approach for translating languages developed by CIFAR Senior Fellow Yoshua Bengio (University of Montreal) and applies it to a more difficult kind of translation — from images to words. “Instead of it being in French, now it’s in images,” says Zemel.
Along with a team from the University of Toronto consisting of Ryan Kiros, Jimmy Ba and Ruslan Salakhutdinov, also an NCAP fellow, and Université de Montreal’s Kelvin Xu, Kyunghyun Cho, Aaron Courville and Bengio, Zemel developed a model that is special because it can select an area to fix its attention on, examines the area to see what is in it, describes it, and then chooses the next region. This is similar to how humans parse an image, finding the important regions, one at a time, and piecing together the whole scene.
“People have always wanted to put attention in models for two reasons,” Zemel says. “We know humans use selective attention, so one aim is to construct models that embody our understanding of how this works. But also, you want to show that there’s some computational advantage to doing it.”
And there is —their new model works better than those that try to describe an entire image at once. It also learns as it goes along, in a sense, generating its next word based on what it knows about the words that came before. For example, if the computer scans one region of an image and generates the word “boat,” it is much more likely to generate a word such as “water” later in the sentence than, say, “cat,” because it understands that “water” and “boat” appear together much more frequently in language.
The model advances past research on what is called classification, which involves training computers to recognize similar objects – such as cats in the case of Google Brain’s 2012 achievement. “The NCAP program has been very successful at doing classification,” Zemel says. Fellows have won many contests with models that classify images quickly and with a high degree of accuracy.
“To go beyond that we really want to understand what’s in an image. Not just say that there’s a dog in there, but we’d like to be able to describe the whole scene,” Zemel says. One of the next steps is to extend this approach to describe videos, too.