Weta Digital’s facial modeling and performance capture animation expert discusses the difficulties and challenges of creating one of the most convincing digital humans ever seen on film.
Through the use of innovative CG animation, Brad Pitt was convincingly de-aged in Benjamin Button, Michael Douglas in Ant-Man, and Samuel L. Jackson in Captain Marvel. Now, in Ang Lee’s futuristic action-thriller, Gemini Man, it’s Will Smith’s turn; the 50-year-old actor plays assassin Henry Brogan and his 23-year-old clone, Jackson (aka Junior), himself an assassin sent to kill his older self.
The talented artists behind Smith’s convincing digital performance work at Weta Digital in New Zealand.
As head of the studio’s facial motion department, Stuart Adcock was responsible for building an authentic, convincing, photorealistic younger version of Smith. While Adcock and his team worked to de-age Smith back to his Bad Boys / Fresh Prince of Bel Air days, the current, 50-year-old Smith, under Lee’s direction, put on the head camera and performance/motion capture suit and acted out his younger self to deliver the much needed data. Not only did the Weta team have to deliver a convincing performance, they also had to deal with Lee’s high frame rate / 120 frames per second filming, which presented an entirely set of challenges in dealing with a digital human face.
Adcock studied Computer Animation at Bournemouth University in the UK; after 15 years working in games, he switched to his dream job, producing visual effects. Landing at ILM, he worked on facial performance capture systems for projects like Doctor Strange, Star Wars: Episode 8 and Ready Player One. He eventually ended up as the facial supervisor on Disney’s Aladdin, and is now considered one of the world’s leading experts in facial modelling / animation performance capture, data acquisition, solving and retargeting. He sees himself as a creative problem solver who views tight constraints as opportunities for creativity; his process blends deep knowledge of Unreal Engine, real-time rendering techniques and VFX approaches with a clear understanding of how these can create visually unique experiences.
Germany-based journalist Johannes Wolters recently spoke with Adcock in Berlin about the film, discussing the many difficulties he encountered while developing one of the most convincing digital human performances we’ve seen in film so far.
Johannes Wolters: A very basic question first, to understand your work better. The work you do, is this Animation or is this a Visual Effect?
Stuart Adcock: I would say, we went into this film knowing that it would be the most challenging visual effects that we have done to date: a believable realistic human. But as the project went on, it felt like we were not just creating a visual effect, it felt more real than that, like we were creating a reality, a real person and having to understand every level of nuance, every aspect of a human from the ground up. Our process was not about de-aging or manipulating photography; we built a fully digital real human. And we used as much of current “Will Smith” as possible to help us do that. We scanned Will and gave ourselves a year to create a digital double of Will. Once we were happy with Will, we then went down the path of trying to augment the puppet into a younger version by doing line-ups to reference of Will during that era (mostly Bad Boys).
Creating two digital Will puppets - one 50 years old, the other 23 - was important for our process, especially when it came to bringing them to life. When describing our process, I often like to draw parallels to music, right? If we listen to a musical performance, and can understand what notes are being played, we can represent this as a language, i.e., sheet music. Then we can play that score back on a different instrument, and it is going to sound familiar but also unique. It’s the same performance, it’s the same piece of music, but it sounds slightly different because you are playing it on a different instrument.
So, for us, the animation challenge is similar; our puppets are our instruments. First, we use performance capture and a facial solver to understand what Will is giving us in terms of raw performance, which we turn into a language of muscle shape activations (similar to music notes). Our facial animators craft this further and ensure we’re able to play a performance back on our digital double and it matches what Will gave us perfectly. When we’re happy, we then swap the instrument, now playing it on young “Junior.” We look at it again and see, is that working? Is it still faithful to his performance, but looking more youthful? Let´s look at some more young reference of Will just to make sure that the corner of his lips and the slightly more pillowed lips with more volume, are faithful to how they were behaving.
We used a lot of visual reference, but also scientific research to validate our observations. We partnered with universities to try to understand more about how faces age over time. Since the face hinges off the skull, we started there. One thing we learned was that the skull is constantly regenerating: every 10 years you pretty much have a new skull. Between Will playing at 50 and “Junior” at 23, his skeleton had pretty much regenerated three times. You can understand what that means regarding your face – some areas of bones are growing further than others, other parts are deteriorating as you get older. The jawbone generally starts to eat away at itself and people lose their teeth and so forth. Will is in great shape, but just understanding the principles of how faces age meant that we could more accurately hit on a 23-year-old version of Will. Often in science it helps to overshoot the mark and gather various samples to better predict a point in time. So, we did exactly that; we referenced a picture of Will when he was eight years old, and re-modelled to that, just so we could understand the transition between current age Will and when he was a young kid. It helped us to understand where we needed to be at 23.
JW: On a very amateurish level, all people are somehow experts in reading a face. Everybody knows what someone is thinking behind their face. But we amateurs could never say exactly why we know that. You, on the other hand, have to be an expert who knows why we as amateurs know how to read a face.
SA: I could not agree more. Pretty much everything we create at Weta, we first look at what it is we are trying to recreate in as much detail as possible. Because when we can understand why certain characteristics are happening, we can then think about how we create that digitally. And with the face, it’s no different. But like you say, we are all experts at reading faces and emotions and we really try to understand what this means. Why are eyes so difficult and what are we seeing in eyes that may be the reason we haven´t been so successful recreating them in the past?
One example is that we discovered your tears are actually a mix of fatty oils and water. And that mixture can actually change depending upon external temperatures in the room. But, it can also depend on your diet and state of mind; it’s a tell-tale sign for whether someone has tired eyes, or whether they are actually a bit upset. Oil focusses light, so while the eyes gets reflective, their base actually receives less broad light, so they subtly darken. And that is just the difference related to the mixture of oil and water in your eyes. Beginning to learn things like that means we can apply that on shots.
On this project, it was not only about the challenge of creating an authentic young Will Smith, it was about creating some of those memories that we all have. You know, when you think about Will Smith in that period of time, you know him from his TV shows and films. We needed to evoke those feelings. This meant that as much technology as we had, as much scientific research as we had, at the end of the day, we needed to step back, look at every shot and feel like: do we believe this is Will? Are we authentic to his work, what he gave us on that day and in terms of his performance? So yes, it was a really big challenge!
JW: I agree, though I’m still not quite clear on whether the main digital character is a visual effect or animation.
SA: For us at Weta, this was a huge animation task. Let´s not beat around the bush. Every shot was animated on top of a performance capture. Right? The performance capture of Will would allow us to determine what musical notes were being played. But ultimately it lands on an animator. Every shot had to go through an animator that was painstakingly chasing subtle details that weren’t captured in the performance. There were levels of details in the tongue, in the neck, in the subtle stickiness and softness of the lips. Even the eyes; there are so many subtle characteristics with the epicanthic fold and how your eyes actually have a sticky tendency very similar to your lips - which subtlety changes the eyelid line around the epicanthic fold. All these little cues were needed, and we could not get that level of detail from performance capture.
We had to chase those details in animation, so this was definitely an animation project! It just meant that the reference we were using was from Will himself. Looking back at the project, we were 100% faithful to Will´s performance. But, the route to understanding and achieving that on a [digital] character involved an enormous number of talented artists and animators.
JW: Why do you think films are moving towards photorealistic animation in such a big way at the moment?
SA: Well, first of all, we’re living in an exciting time where we are able to tell fictional stories in such a believably realistic manner. So, it may be a little bit of a flavor of the decade in some sense. But I think it is quite exciting now that we have another avenue to allow actors to play roles that they had not really considered in the past. Some of that magic can be playing a creature like “Gollum,” a dragon or other fantastical character. Now, it also means an actor can play a different period of their life, when they were younger or older. That’s pretty exciting. Will totally embraced that!
And for anyone to say we don’t need the actor anymore, you just need to look at Will and how much effort and work he put into this performance. First of all, he had to play “Henry,” then “Junior” in a motion capture suit with a head camera on. With the amount of effort and dedication he brought to the role, if anything, we’ll need more of an actor’s work in the future. It’s the actors giving us these authentic performances, which is what they do so well.
JW: You describe yourself as a “creative problem solver” who views tight constraints as opportunities for creativity. How did you deal with Ang Lee’s high frame rate of 120 frames, rather than 24 frames per second?
SA: Well you’ve now touched on one of the biggest challenges on this film. Let’s not forget this was five times the amount of frames. That means for every shot that falls onto an animator, it’s like they are playing it back in slow motion. They are looking at details they’ve never imagined, that we have never seen before. Normally those details would have been hidden behind motion blur, or simply left for the audience’s eyes to fill in the gaps.
JW: Was that a blessing or a curse?
SA: [Laughing out loud] We learned a lot more about the face by studying it in that kind of resolution, but it definitely put a lot of pressure on our animation team. By the end, everybody had gotten more used to it. But, the total amount of animated frames we rendered was nearly four hours, which is a scary amount of time. And that meant we were chasing a level of nuance in the skin that we were unable to achieve with conventional methods. Perhaps you are aware of things like blend shapes; it’s kind of an industry standard way of puppeteering a face. Now, the good thing about blend shapes is they give you full control over a shape. But, the frustrating thing about blend shapes is, in essence, they are linear. So, when you start to combine lip corner pullers with cheek raisers and upper lip raisers, you start to layer these shapes in a very linear way.
But skin is actually incredibly non-linear. Skin has inertia; skin has a memory. If I raise my brows and hold them for a moment in time and then drop my brow momentarily for a few milliseconds, the memory of the skinfolds will stay true. And so the appearance of skin is not this kind of layered linear stack of shapes. If you trace a point on the upper lid during a blink, especially at 120 frames per second, it actually does more of an arcing circular motion. As the palpebral relaxes and then tenses again to pull the lid up, there are two muscles that overlap and behave together. And that means you get more of a flushing motion. Using a simple shape that is called “blink,” where you see half on the way in to a blink is the same shape appearance as half on the way out, is not realistic.
So, we developed a new technology called “Deep Shapes.” Deep Shapes gave us control over the energy of a shape in terms of the layers of skin, so that we were able to dial in a shape called, say, “lip corner puller,” but were also able to transition the energy of the shape from the deep fascia layer to the top level epidermis layer. Now we had this amazing extra level of control and dimension to play with. We largely automated the effect. So what this meant is that our animators could work quite quickly in terms of dialing in the key shapes for these beats, but then we would run a Deep Shapes Pass onto it and it would give us all of these lovely non-linear transitions from shape A to shape B.
The things that scare you in animation, quite frankly, are when you look at these details. You look at the side of someone´s cheek as they are talking. And you see all those flutters, beats and little movements. And it’s like, you can´t attribute that to the jawbone. There is no one-to-one correlation there, but there is a lot going on with the skin. In many ways, Deep Shapes allowed us to capture that extra level of nuance in the skin, and at 120 frames per second, really helped us to create high fidelity animation.
JW: How did you communicate with Ang Lee, who is known to be very much into visual effects. It feels that communication was especially key on a project like this one.
SA: It was. Practically speaking, the best way to communicate animation to a director for sign off is to not have to render everything, right? That would really hurt us in terms of our time capacity. So, we tried as much as possible to show Ang real-time renders of the facial performances to get his initial feedback. At Weta we use Manuka as our final renderer, but Gazebo is our real-time renderer for animators. We were running out Gazebo renders for Ang to take a look at.
Then it was about helping Ang to understand what the difference was between the two, which he was very good at. So, when we’d look at Gazebo, he knew he wasn´t looking at final resolution. There were a lot of question marks on things like, “That doesn´t feel quite photoreal. What am I looking at? How am I supposed to give approval on an animation when I can´t really feel whether it’s believable or not?”
So, we tried to communicate with him the difference between Gazebo renderer and Manuka renderer. Early on, we showed him a few fully rendered shots, so at least he could understand the range he was dealing with… “Okay I get it. So when I look at a final render, I see these kinds of details, and when I look at Gazebo, I’m looking at the bigger broader forms of the shapes!”
For the most part, we were able to sign off and approve animations using the real-time rendering. Ang would say, “I think I feel pretty good about this one. Let´s have a look at a render next time.” A few shots, after he looked at a render, went back to animation. because, as you know, only when you see it finally rendered can you notice some small imperfections that need fixing. So, we met with Ang three times a week, showing him shots and iterations. He was very much involved all the way through.
Ang is known for getting the best out of actors, and I think the same is true for visual effects. He was really nice to work with. He allowed us to collaborate with him. When he came to us, he always talked about having a “feeling.” Like he had a feeling for when he captured Will in this particular moment. One day I remember he said, “Look, I need you to understand that this a ruthless assassin, but at the same time, I need you to feel like you want to sit down with him and enjoy a nice warm, hot bowel of chicken soup!” And we were like, “Ahh… Okay?” We just agreed with him; after the call we were like, “Chicken soup??? What does that mean???” We wrestled with that for a bit; it became kind of an inside joke, that we had to find the recipe for this chicken soup! And in the end, it came down to the eyes! It was the subtlety in the eyes! How the epicanthic fold fell. Will has incredibly soft eyes. There is a soft nature to his eyes. So even as he is playing this ruthless assassin, there is still this chicken soup effect!