NC's 'Facial Animation' technology that reduces post-processing costs

후처리 비용 줄이는 엔씨 AI의 '얼굴 애니메이션' 기술

gamemeca Ⅰ Kim Hyung Jong 게임메카 김형종 기자 2026.06.17 15:34 UTC+9

AI Summary

NC AI's VARCO Face Sync uses diffusion models to fix lip-syncing issues in games. By capturing custom data and refining AI training, they have eliminated the need for manual post-processing. The future lies in AI-driven rigging and integrating body language for true digital humans.

엔씨 AI 피지컬 AI 연구소 장한용 실장 (사진출처: NDC 공식 유튜브 영상 갈무리) — ▲ Image Source: NC AI Physical AI Lab Director Han-yong Jang (Image Source: Screen capture from the official NDC YouTube channel)

Let’s be honest, nothing pulls you out of a game faster than a character whose lip movements have absolutely nothing to do with what they’re actually saying. It is a persistent headache, so much so that some developers simply give up on lip-syncing entirely—a problem that becomes even more glaring during localisation. In this brave new world of generative AI, however, we’ve seen a flurry of tech demos promising to fix our facial animation woes. A recent talk at the NEXUS Co., Ltd. Developer Conference (NDC) dived deep into how this is actually being put to work.

On Wednesday the 17th, at NDC 26, Han-yong Jang, Director of NC AI’s Physical AI Lab, took the stage to discuss 'From Speech to Expression: Streamlining Facial Animation with Generative AI.' Jang argued that as the baseline for in-game graphics and body animation continues to rise, the quality of facial animation has become the next critical frontier for developers.

기존 얼굴 애니메이션 생성 기술 (사진출처: NDC 공식 유튜브 영상 갈무리) — ▲ Image Source: Conventional facial animation generation techniques (Image Source: Screen capture from the official NDC YouTube channel)

Historically, we’ve relied on motion capture, but the fidelity often lags behind body movement, forcing artists to step in and fix things by hand. While NVIDIA’s AI solutions once offered a glimmer of hope, they struggled with the specific quirks of game audio—like intense acting or heavy reverberation—leading to distracting lip tremors. Epic Games had a go at it too; being an engine developer, they managed to smooth out the jitters, but the output was often too soft, resulting in mushy, imprecise articulation. Both approaches, ultimately, required a painful amount of post-processing time and money before they were fit for a commercial release.

Jang pinpointed the root of these failures: flawed source data and the nature of AI models themselves. When a character pronounces bilabial sounds (like 'm', 'b', or 'p'), the necessary lip-closing action isn't properly captured in standard training datasets like SMPL. Furthermore, because the same sound can be spoken with varying mouth shapes, AI models tend to 'average' the results, churning out a muddy, mumbling mess of an animation.

1년간 생성형 AI 애니메이션 기술의 변화 (사진출처: NDC 공식 유튜브 영상 갈무리) — ▲ Image Source: One year of evolution in generative AI animation (Image Source: Screen capture from the official NDC YouTube channel)

To break this cycle, the team at NC AI developed 'VARCO Face Sync,' leveraging diffusion models and transformer architecture to generate animation directly from audio. It’s a handy tool: you can take side-quest dialogue or NPC lines, convert them to TTS, and have the animation generated automatically—all exportable as Unreal Engine assets. To tackle the 'dirty' data problem, they even built their own facial capture rig to record specific bilabial movements, ensuring the AI actually had something accurate to learn from.

VARCO Face Sync also employs a retrieval-based audio conversion technique, which maps various speakers onto a single reference dataset. This effectively filters out noise, replacing any input audio with a clean, recognisable voice. To avoid the uncanny valley of a character having, say, a male upper lip and a female lower lip, they used identity embedding to keep speaker characteristics consistent. They also focused on collecting data only from segments where emotions were clearly defined, acknowledging that it’s simply too much to expect a human actor to maintain high-intensity performances for hours on end.

The end result is a system that reduces variance and boosts stability to the point where it can be dropped straight into commercial cutscenes, bypassing the need for manual QA or artist intervention. When a writer plugs a script into the system, the animation is born instantly, drastically shortening the pipeline. As Jang put it, the goal isn't just quality—it's stability: 'Ensuring consistent stability is paramount, and above all, we need to avoid burning extra time and money on manual post-processing.'

서로 다른 얼굴 애니메이션 (사진출처: NDC 공식 유튜브 영상 갈무리) — ▲ Image Source: A comparison of different techniques and their resulting facial animations (Image Source: Screen capture from the official NDC YouTube channel)

Jang isn't resting on his laurels, though. He outlined four areas for future improvement. First is using video generation AI to build training data. Why rely on limited motion capture when you can generate hundreds of hours of data and cherry-pick the best bits? Second, he pointed to AI-driven automated rigging. Manually setting up rigs for complex facial expressions is a bottleneck that keeps character expressiveness within a very rigid box; AI needs to handle this heavy lifting.

Third, we need better emotional nuance. Simple labels like 'happy' or 'sad' just won't cut it for film-quality performances. By using LLMs to label video data with complex directorial notes, we can coax more diverse, natural expressions out of the models. Finally, he envisions the integration of face and body animation. Once your lip-sync is perfect, a stiff, lifeless body becomes glaringly obvious. The ultimate goal, he says, is a digital twin or digital human technology where facial expressions and body language are generated in perfect, harmonious unison.

여러 게임에서 작은 요소지만 유저들을 불편하게 만드는 것이 바로 캐릭터의 대사와 입술 움직임이 서로 다를 때다. 둘의 모양을 맞추기가 어렵기에, 일부 게임은 처음부터 입술 움직임 묘사를 포기했으며, 특히 현지화 과정에서 대사와 입모양이 다른 경우도 많다. 생성형 AI의 시대, 이를 캐릭터 얼굴 애니메이션 제작에 활용하는 기술도 다수 발표됐는데, 이를 소개하는 강연이 열렸다.

엔씨 AI 피지컬 AI 연구소 장한용 실장은 17일 수요일 넥슨 개발자 컨퍼런스(NDC) 26 행사에서 '음성에서 표정으로, 생성AI를 이용한 얼굴 애니메이션 제작 효율화'에 대한 강연을 했다. 장 실장은 최근 게임 그래픽과 신체 애니메이션 품질 상향 평준화에 따라 얼굴 애니메이션 중요성이 커졌다고 설명했다.

과거 얼굴 애니메이션은 주로 모션 캡처 방식을 사용해 구현했으나, 캡처 품질이 신체보다 떨어져 아티스트의 수작업이 반드시 필요했다. 이를 대체할 수 있는 엔비디아 AI 기술이 공개됐지만, 과격한 연기 톤이나 울림 효과가 들어간 게임 특유의 음성 데이터가 기본 학습 데이터에 없기 때문에 입술이 떨리는 오류가 발생했다. 에픽게임즈에서도 유사한 기술을 공개했는데, 엔진 전문 개발사였던 만큼 떨림 현상은 제어했지만, 입술 움직임이 너무 부드럽게 재현돼, 발음이 뭉개지거나 정확하게 표현되지 않는 한계가 있었다. 두 기술 모두 상용게임 적용 시 별도의 후처리 비용과 시간이 소모됐다.

장 실장은 이러한 한계가 발생하는 근본적인 이유는 학습 데이터 원본의 한계와 AI 모델의 특성 때문이라고 설명했다. 양순음(ㅁ, ㅂ, ㅃ, ㅍ)을 발음할 때 입술을 닫았다가 여는 동작이 SMPL 등 학습용 원본 데이터에 제대로 구현되지 않아 결과물 품질이 떨어진다. 또한 동일한 발음이라도 음성 정보와 입술 모양이 매번 다르기 때문에, 이를 그대로 AI가 학습하면 결과값이 평균에 수렴해 웅얼거리는 듯한 애니메이션이 생성됐다.

엔씨 AI는 이런 문제점을 해결하기 위해 확산 모델 훈련 방식(디퓨전)과 트랜스포머 모델을 기반으로 음성에 맞는 애니메이션 결과값을 도출하는 '바르코 페이스 싱크'를 개발했다. 바르코 페이스 싱크는 성우 녹음이 어려운 서브 퀘스트 대사나 NPC 음성을 TTS로 변환 후 애니메이션을 자동 생성할 수 있으며, 언리얼 엔진 에셋으로도 바로 출력 가능하다. 특히 부정확한 원본 데이터 문제를 극복하기 위해 자체 얼굴 캡처 장비를 개발하여, 양순음 데이터를 직접 확보하고 학습량을 늘림으로써 입술 움직임을 최대한 재현하는 데 힘썼다.

바르코 페이스 싱크는 다수 화자 음성을 하나의 기준 데이터로 묶는 리트리벌 방식의 음성 변환 기술을 적용해, 어떤 음성이 들어와도 기존에 학습된 명확한 음성으로 치환하여 노이즈를 차단했다. 윗입술은 남자, 아랫입술은 여자처럼 표정 자체가 비정상적으로 섞이는 현상은 화자를 명확히 구분하는 아이디 임베딩 기술을 적용해 방지했다. 연기자가 오랜 시간 격한 감정을 유지하기 힘들다는 점을 고려해, 감정이 명확한 구간만 선별적으로 데이터를 수집해 감정 표현의 정확도를 높였다.

기술은 품질 편차를 줄이고 안정성을 극대화하여 QA나 아티스트의 별도 후처리 없이 상용게임 컷신에 즉시 적용됐다. 기획자가 시나리오 스크립트를 시스템에 입력하면 그 즉시 애니메이션이 생성되기 때문에 작업 공정이 크게 단축된다. 장 실장은 "전체적인 품질의 안정성이 가장 중요하고 무엇보다도 그 안정성을 확보하기 위해서 우리가 후처리를 그 인력을 더 투입해서 비용과 시간을 들이지 않는 것이 중요하다"고 강조했다.

장 실장은 개선해야 할 남은 과제로 네 가지를 꼽았다. 첫째는 영상 생성 AI를 활용해 학습 데이터를 구축하는 방안이다. 실제 배우의 모션 캡처 데이터보다 영상 생성 AI로 수백 시간 분량의 다량의 데이터를 만들고 원하는 정보만 선별해 학습시켜야 최종 애니메이션 품질을 압도적으로 높일 수 있기 때문이다. 둘째는 AI를 활용해 릭(Rig)을 자동 생성하는 기술이다. 현재 아티스트가 세밀한 표정을 짓는 릭을 수작업으로 구현하기에는 너무 복잡하여 캐릭터의 표현력 바운더리에 한계가 발생하므로, AI가 이를 대체해야 한다고 설명했다.

셋째는 세밀한 감정 표현 기술이다. 기존의 기쁨, 슬픔 등 단순한 감정 분류만으로는 영화 품질의 자연스러운 표정이 나올 수 없으므로, LLM을 활용해 영상 데이터에 복잡한 감정 디렉션 상황을 세밀하게 레이블링하여 다채로운 표정을 구현해야 한다고 설명했다. 넷째는 얼굴과 신체의 통합 애니메이션 자동 생성이다. 입술 움직임이 자연스러워지면 상대적으로 멈춰있는 눈동자나 어색한 신체 제스처가 더욱 눈에 띄어 이질감을 주기 때문에, 궁극적으로는 얼굴 전체 표정과 신체 움직임이 모두 조화롭게 자동 생성되는 디지털 트윈 및 디지털 휴먼 기술로 발전해야 한다고 덧붙였다.

View original text

This news was translated by AI.

Kim Hyung Jong, Reporter

News

NC's 'Facial Animation' technology that reduces post-processing costs 후처리 비용 줄이는 엔씨 AI의 '얼굴 애니메이션' 기술

NC's 'Facial Animation' technology that reduces post-processing costs

후처리 비용 줄이는 엔씨 AI의 '얼굴 애니메이션' 기술