Scientists have developed a new way for computers to see and understand human body pose in videos.
A computer program called Modeep uses a new method to estimate the position of the human body.
Photo courtesy of Arjun Jain, Jonathan Tompson et al
The algorithm developed by a group of scientists including Yann LeCun (Facebook and New York University), co-director of CIFAR’s program in Learning in Machines & Brains (formerly known as Neural Computation & Adaptive Perception), could have applications in animation, human-computer interaction and virtual reality.
The approach, named MoDeep, uses a technique called convolutional neural networks that LeCun invented 25 years ago. Convolutional neural networks let computer scientists train systems to classify images according to the objects they contain, such as body parts.
“The technique has come to the forefront in the last few years because the machines that we have now are really powerful enough to run this at a big scale and we have lots of data to train them,” LeCun says. “One of the big successes of the NCAP program over the last 10 years was to revive those techniques.”
In MoDeep, the convolutional neural network teaches the computer to recognize hands, elbows, shoulders and other body parts and then puts the pieces together to recognize the whole body. In addition, it can tell which body parts are moving, and in what direction. The work represents a major improvement in computers’ ability to identify the position of people’s body parts. The approach lets a computer see if a person is running with arms bent or standing with arms in the air, for instance.
LeCun compares the process of the convolutional net to a tiny window representing the network’s field of view that slides over an entire picture, scanning one small region at a time and detecting whether it has found the object it is looking for, such as a hand. The researchers repeat the process for every body part they want to detect.
MoDeep also detects motion by tracking the direction in which each pixel in an object is moving from one frame to another in a video. This feature draws inspiration from the human brain’s primary visual cortex, an area of the brain that detects both patterns and motion.
“There are neurons that will fire if something moves within their little field of view in a particular direction. In computer vision, this kind of feature is called optical flow or motion features,” Lecun says.
The researchers tested MoDeep on thousands of frames from Hollywood movies and found the network is better at estimating body pose when it has image and motion data to work with, rather than images alone.
The Asian Conference on Computer Vision from Nov. 1 to 5, 2014 published the paper, co-authored by Arjun Jain, Jonathan Tompson, Christoph Bregler and LeCun. It is the third in a series on human pose estimation by various co-authors. The first two explored recognizing human hands and learning body pose from still images.
LeCun says MoDeep could have a wide variety of applications such as game consoles that recognize gestures, avatars that mimic our movements on screen in real time or health care uses for checking gait, posture and range of motion.