Peter Plantec analyzes the cognitive dissonance that's at the heart of the Uncanny Valley problem in creating virtual humans.
Animation has its wall, much like the sound barrier -- difficult to get past, but theoretically possible. It happens when our 3D-animated humans start to look too real. You'd think the more real a character looked, the more believable it would be, but just the opposite is true -- to a point. Before you reach the threshold of believability, you have to travel through the "Uncanny Valley."
Robot designer Masahiro Mori coined the expression "Uncanny Valley" way back in 1970. He was referring to the seemingly inexplicable plummet in credibility of robotic characters as they approach a certain high level of realism. In other words, the more real they look the creepier they make you feel.
Interestingly, Sigmund Freud discovered that his patients often described an uncomfortable feeling when something was both familiar and unfamiliar at the same time. He called it "the uncanny." It happens when a person holds two or more conflicting beliefs, emotions or feelings at the same time. Of course, this is where cognitive dissonance comes from. For example, when a virtual human actor looks very, very real, you want to believe it is real, but a subconscious protective mechanism in your brain starts looking for flaws to prevent you from being fooled.
The result is simultaneous belief and disbelief. Movies are all about suspension of disbelief and this kind of dissonance disturbs that process -- ditto for games.
Think of it this way: most animated characters are not trying to fool you. They are what they are, but when they start to look too human, they're trying to fool you and your subconscious is wary of being fooled. The result is a nagging feeling just below consciousness that prevents clean suspension of disbelief. At some point, when we get nearly everything right, we'll be able to again believe. For now, consider such iconic characters as Robbie the Robot or Homer Simpson. With them there is no cognitive dissonance. We understand and accept them because there is no pretension.
On the other hand, Aki Ross in Final Fantasy inadvertently comes across as pretentious: a cartoon character masquerading as a human. As she moves, our minds pick up on the incorrectness. And as we focus on her eyes, mouth, skin and hair, they destroy the illusion of reality. Adding a voice we recognize (Ming-Na) only complicates matters. Thus, we identify Aki Ross in too many incompatible ways simultaneously and our brains can't handle it. But more about Final Fantasy later.
Crappy MoCap also kills the illusion every time, so care must be taken on the body, but most especially on the face's emotional zone. It's roughly a circle that goes from the bottom of the chin to midway up the forehead. Side to side, it's the entire width of the face. If we get that right, most of the audience will forgive minor errors in body language... but don't push it. Theoretically, at some point, we can create a character so real in looks and behavior that most people will accept it as real. That's the goal.
It's very expensive and time consuming to capture all the paralinguistic nuances that make up a performance. According to Dr. Parag Havaldar, lead software engineer on Beowulf, "[The Robert Zemeckis film] is an awesome accomplishment when you think about it. It's a 100% animated film with many digital bodies and faces exhibiting believable realism. Sure there are flaws when you compare it with real actors, but we've come a long way since Polar Express. We used advance techniques to capture eye movement and detailed face movements using our ImageMotion technology." He added that Sony Pictures Imageworks had more than 14 terabytes of performance capture data to filter and tweak.
Still, the level of keyframe tweaking was not always what it could be. For example, I noticed in one of Anthony Hopkins' performances as King Hrothgar that the actor exhibited many subtle mouth and facial expressions that were not picked up by the performance capture equipment. Several layers of nuance were added by hand, but something was still missing. We're talking very subtle. For example, he had a subtle, nasty little smile that held so much meaning and some of it is missing in the final print. That tiny bit of smile apparently held a lot of emotional information, because I could feel the difference. "It is not a procedural process and we're making good progress, but the amount of hand tweaking that would have been required to enunciate all subtle expressions on every character in a movie this scale would have been expensive.
"For me this means that the crew was able to bring the digital performances to a place where suspension of disbelief is not only possible, but less painful than in earlier works. I understand the stereo version (3-D) is remarkably compelling and I saw a clip of it in France. Now I can't imagine watching it in plain old 2-D."
How Cognitive Dissonance Works
Let's say that the first virtual human actor was the Great Synthespian, Nestor Sextone, back in 1988. Created by Jeff Kleiser, et al.; Nestor is virtually all-hand animated using inter-penetrating joints. He didn't even have IK. As a result, the animation is a bit primitive by today's standards, but was cutting edge back then. It's easy for us to accept Nestor as real because he is a cartoon character playing a synthetic human -- no attempt to fool us. Bursting with personality, he creates no cognitive dissonance. He is what he is. With Nestor we are in full-on belief mode. We're really comfortable with him and enjoy his short performance. If you haven't seen Sextone for President, you should. It's become a historical reference. The important thing here is that animated characters are a long established reality for us. We don't see them as perpetrating any perceptual trickery.
If we jump ahead now to 2001 and Final Fantasy: The Spirits Within, We have one of the first real attempts to create animated characters that look like actual people. They are, in fact, animated characters playing human roles. It was a breakthrough movie, and they did a fantastic job for the day. Unfortunately, the animated behaviors, especially facial expressions and eye movements were off by a mile. However, the characters looked real enough to trigger our innate bogosity detectors. My subconscious brain started screaming "Bogus people -- not real!" Even though I wanted to suspend disbelief, it was extremely difficult. Unfortunately, the subconscious portion of our minds is hopelessly primitive, and unable to process logic, so we can't just shut it off, or even shut it up.
As a psychologist, I spent much of my time helping people cope with misguided unconscious behaviors, which were self-defeating. An example would be the very smart/pretty woman who grew up being told how pretty she was by parents and how stupid she was by jealous "friends." As a young adult, she now believes she really is more beautiful than she is, and has little faith in her intelligence. Both conditioned beliefs defy logic, but can't be easily shut off. At this point her subconscious suppresses any attempt at looking or being smart. She's unaware of this. You see it all the time on the TV reality show, Beauty and the Geek.
As applied to the Uncanny Valley problem, your subconscious knows -- big time -- what real humans behave like. We spend our lives reading unspoken messages from others, even though we're not fully aware of it. It's how we try to find the truth. It's in this unconscious realm where most communication takes place and where our beliefs are housed. The characters in Final Fantasy communicated primarily through speech, whilst the rich world of subliminal expression was completely missing. All nuances were missing. Because they looked so real, we expected that full spectrum of face and body language... what we got was a dull, numb nothingness. That missing stuff is specifically what triggered our bogosity alarms bringing on disbelief. I still liked the movie, but it was a lot more work than it should have been to watch...
Let's move up to 2004 and The Polar Express. With improved MoCap and performance capture and other technologies, director Robert Zemeckis was able to push the envelope beyond Final Fantasy. It was a tour de force that brought us half way down the Uncanny Valley's big drop off. For one thing, Zemeckis chose here, as he partially did in Beowulf, to have his animated characters look much like the real actors playing them... Tom Hanks, for example, as the conductor and six other characters. This exacerbated the situation. Not only do we expect the characters to convince us they're human, but also that they are Tom Hanks. Zemeckis wisely chose the non-photoreal shading used throughout Polar Express, which pulled everything back a bit from reality. But the look and voice characteristics still triggered disbelief. It was a wildly uncanny experience.
Despite advances in technology, the characters in Polar Express were, for many, harder to buy than the ones in Final Fantasy. Why? They pulled the trigger on our bogosity detectors faster and harder. They creeped many of us out because the unconscious communication in Polar was not missing: it was weirdly wrong. Those strange eye movements -- much like those of an insane person, freaked me out. But then I'm a delicate persona. Don't get me wrong, Zemeckis is a true pioneer in this field and we all owe him a debt for taking the upfront risks involved. We're all learning from his work. Beowulf clearly takes many steps in the right direction, but he's still expecting us to accept both the actors and the characters, which is problematic.
Moving up performance wise, we look at the slicing edge technology used in Pirates of the Caribbean: Dead Man's Chest and At World's End. Here we had plenty of money to focus on a relatively limited part of the film. I'm sure you read about the innovative "in vivo" approach to performance capture ILM used, where Bill Nighy had spooky face makeup and tagged gray pajamas that were tracked during actual on-set performances with the other actors. They actually had to replace Bill's entire body as well as his face and head. Davy Jones' amazing octopus head was separately animated and seamlessly meshed with the rest of the character. It was a brilliant job. Several factors help us to buy Davy Jones as a real human performance. First, there's no attempt to make us believe that the actor is Bill Nighy. That helps a lot. Second, we're not asked to buy that Davy is actually human. A lot of what we are trained to look for in facial behavior is obfuscated by the monster "makeup." This helps tremendously. Then there is the smart use of quirky human like behaviors that make you believe. One would be the little pop Davy does with his upper lip when he's thinking. Small human-like quirks added to a dramatic performance that eased the task of suspending disbelief.
Besides the innovative, remote Imocap process at ILM, new rendering techniques were used that create ultra-realistic skin on Davy. The process imitates the way light penetrates our epidermis and diffuses through our subcutaneous layers with their blood vessels lending color. All this advanced technology developed by brilliant minds would mean little without the artistry of people who applied it. Davy is a work of living moving art that we accept as human. Nighy's exaggerated performance also helped a bit, as we subconsciously accept Davy as an act, rather than an animated performance.
There is lot of discussion about weather Davy really got us across the Uncanny Valley, because, though he was fully believable, he wasn't fully human. I'm personally torn on the issue. I can look at Davy Jones (as can most of the audience) and accept the character without suffering subconscious angst. Part of it is Nighy's wonderful in vivo acting performance, but much of it is due to the amazing attention to facial animation detail the ILM people put in. For example, if you look at Davy's eyes, which are a big communication channel, they're full of subliminal messages being expressed by Nighty's real performance. It was something to behold.
Pure technology is not capable of capturing near the level of performance needed to cross the Uncanny Valley. So, ILM wisely used keyframe tweaks based on the actual video of Nighy's eyes. It was absolutely necessary and, for me, it worked superbly. I believe that it was the marriage of superb on-set MoCap, married to brilliant hand animation that catapulted Davy into the realm of the believable. If they had done just one thing wrong, it would have blown the whole illusion.
Yet, could ILM have pulled off the same level of "humanness" if Davy had looked like a human? I don't think so. One of my colleagues, a very knowledgeable vfx supervisor, suggested that ILM found a "sweetspot" in the Uncanny Valley. That's probably true, but it takes away nothing.
Some Practical Considerations
Now, try visualizing the Uncanny Valley in 3-D: we have a number of continua forming the geometry. Below is a non-exhaustive list of Uncanny Valley parameters:
Look: Cartoonish/Photoreal (Nestor Sextone; Aki Ross from Final Fantasy)
Morphology: Monster -- Human (Davy Jones); Human (Angelina Jolie as Grendel's mother in Beowulf)
Behavior: Stylized -- Recognizable (Davy Jones; Angelina Jolie)
Face: Unfamiliar/Familiar (Aki Ross; The Conductor [looks like Tom Hanks])
Voice: Character -- Recognizable (Lucille Bliss as Crusader Rabbit; Angelina Jolie)
Animation Style -- Squash-and-Stretch tweaked MoCap (Roadrunner; Davy Jones)
In general, the farther to the left on each of these continua, the easier the character is to sell. The farther to the right, the more human -- like/familiar/recognizable; and difficult the illusion is to sell. Another thing is, of course, that if you put squash- and-stretch on a photoreal virtual human, it doesn't work. However, at least one expert that I spoke to said that if a little squash-and-stretch had been added to the animation in Polar Express, the characters would have been more acceptable, considering the shading choices made. In Beowulf, a semi-photoreal Angelina Jolie with her actual voice is a hard sell.
I've only mentioned six parameters and they impact each other in serious ways. For example, If we did everything right, but had Lucille Bliss, providing the voice for Jolie as Grendel's mother, it would be disaster, so let's simplify and not go there. By backing off on the photoreal and making her slightly less real, she becomes easier to accept. In our minds Angelina becomes a real person playing a cartoon character -- much easier to buy, than a cartoon character playing a real person. Then there's MoCap. It doesn't work well for cartoon characters, just as squash-and-stretch doesn't work for life-like photoreal characters.
The Uncanny Valley is filled with paradoxes, isn't it? For example, we accept Davy Jones as human because he has a fully rendered human personality. It is completely counterintuitive, but in moderation, body morphology going away from human actually helps us to see the character as human. In Davy Jones, we see a human playing a monster, something we're familiar with. In fact, Davy is entirely synthetic with only Bill Nighy's captured data stream providing a shadow of humanity. As you can see, the creature's morphology does not prevent the actor's humanity from shinning through. That's the key: finding and finessing the humanity.
As an aside, the artists and engineers at Electronic Arts discovered some of that when they reverse engineered real-humans to make them appear virtual in a game cinematic. I was told that they had to remove such things like skin pours and arm hair, and that they gelled the real hair to make it seem less real. Habib Zargarpour, vfx pioneer and art director at EA, told me: "We had to remove all the stuff that virtual humans don't have right yet in order to get believable virtual humans -- played by real humans."
The Technology We Need
Artistry and psychology are keys to creating believable characters at this stage of evolution. The best technology in the world will not help if we don't know what to capture and how to apply it. A great human performance is the basis for all virtual human work at the present. Sure technology like Softimage's Face Robot are finding reasonably credible ways to simulate human performance from limited data, but full believability remains a ways off. Assuming a great acting performance like Nighy's, we need to know what to capture and what to do with it.
In a recent interview with Variety's Anne Thompson, vfx whiz Rob Legato intimated that the keys to crossing the Uncanny Valley are to avoid realism and to liberate the director from the computer. Legato should know: he has provided a pioneering director-centric system for James Cameron on Avatar, allowing him to incorporate a flexible, live-action-style methodology as part of the next-gen virtual filmmaking process.
The technologies most in need of advancement to achieve what we want also involve subtle performance capture. Capturing eye expressions, including the muscle ridges around the eyes is very important. That includes the Saccadic eye movement, which is the rapid, constant eye movement that changes the focus of the retina as it scans the scene. Often neglected but critically important would be the inside of the mouth. Tongue and teeth/jaw movement is almost as critical as the eyes, and then there's spittle. Then there's all that secondary movement like fat and hair. Hair movement is more critical than you might expect. Believe it or not, you can get the whole thing right and have the hair move just a little bit wrong, and you blow the entire illusion. Final Fantasy was a great example of distracting hair.
Render quality is also essential to believability. I have to mention Professor Paul Debevec's work in studying the complex process of how light interacts with skin and flesh that is helping immensely. Also, Christophe Hery at ILM contributed both his artistic and his technical skills in rendering Davy Jones' skin.
Creating a virtual human actor is a complicated process with no room for sloppy anything. It's a synergistic process among art, technology and psychology. We have to discover all the key human factors and then obsessively apply them and tweak them, until we have it right. Only then will the average audience member accept our illusions as real with comfort and enthusiasm. I'll continue my research in this arena and tell you more as I discover it. Let's hope we all get it right in the near future and move way beyond that nasty deep uncanny gully. The Holy Grail is a fully human looking, perhaps recognizable, virtual human, which we can all believe in without dissonance. I figure two more years with luck.
Writer's Note: I swapped ideas with a number of experts while preparing this article. Josh Kolden of Crack Creative and Christophe Hery at ILM were particularly helpful along Professor Paul Debevec and several people at studios who prefer to remain anonymous. There were also dozens of people over the years that have helped me with this research. Thank you all.
Peter Plantec is a best-selling author, animator and virtual human designer. He wrote The Caligari trueSpace2 Bible, the first 3D animation book specifically written for artists. He lives in the high country near Aspen, Colorado. In addition to his work in vfx and journalism, Peter is also a clinical psychologist with more than a decade of clinical experience. He has spent several years researching the illusion of personality in animated characters. Peter's latest book, Virtual Humans, is a five star selection at Amazon after many reviews.