How Babble's face tracking works

February 4, 2026 · 6 min read

Maintainer of the Project Babble docs

note

Portions of this blog post are paraphrased from the presentation we gave at Furality Somna 2025. You can view and download a copy of our presentation here

Previously, I wrote an article describing how our eye tracking works. While we have given fully-fledged presentations in person and virtually on how our face tracking works, it's only dawning on me that we have never written a blog post on this. Heck!

At a high level, our face model is a heavily modified EfficentNetv2. Unlike our eye model, which requires a per-user calibration, our face model is a strong generalist model that will work out of the box.

Motivations

Before Babble, you had a handful of face tracking solutions for use in Social VR:

Vive

I'm no stranger to the Vive Facial Tracker:

Vive Facial Tracker

Released in March 2021, the VFT is considered to be the first accessory of its kind. And for the most part, most of its expressions are tracked pretty well without any kind of calibration. Of course, it has a handful of shortcomings:

Picky camera placement. It was designed with Vive products in mind.
Poor tongue shape tracking (tongue twist, flat, skinny).
Black box without any feedback to the user.

Vive also had a number of other products:

Vive Focus 3 Facial Tracker

Vive "Squidward" Full Face Tracker

All of the above accessories leverage the SRanipal runtime and track SRanipal shapes.

However, all of these existed in a further locked down ecosystem. At one point, Vive updated their software to work only if the accessory was plugged into a Vive headset.

Pico

Pro 4 Business/Enterprise

Tracks arkit expressions.

We wanted to surpass all of these in terms of cost and quality. We also wanted to challenge the Quest Pro, Pico, etc. and other headsets in terms of fidelity. Additionally, we also wanted to emphasize user privacy, and data privacy.

History

In order to train an AI model, you need a lot of data. This shouldn't come as a surprise to anyone in 2026 who hasn't been living under a rock. Where do you get data? The internet, of course!

However, to train a face-tracking model you need a lot of face data. This was the first roadblock we had to surmount, some datasets exist:

TEYED for Eyes
WFLD for Facial Landmarks

However, none of these contain the right information we needed to train a face tracking model. What we really need are blendshapes, a normalized 0-1 metric representing a facial expression.

We needed a image dataset with:

Lower faces (the above datasets contained the full face)
Close up, high FOV images (4 inches from the lower face @ 160 degrees)
Large variety of faces represented (skin tone, lighting, face shape, facial hair, etc.)

And a lot of images. Millions of them, with blendshapes to match!

Datasets

The Original Dataset

The original dataset utilized SummerSigh's iPhone and a Vive Facial Tracker.

This had a handful of issues:

Limited data generation
Images not representative of VR headset camera placement

End result? Mediocre

data-og

Synthetic Dataset v1

Synthetic Dataset v1 took a more intelligent approach. We used free 3D assets to generate diverse facial images. This produced the first general models that could run on other user’s computers. Below is an early render:

Over the course of 2 years we improved:

Face shapes
Skin textures
Camera positions
Lighting
Number of face (meshes) used

We used this approach until Babble App v2.0.5.

info

Up until this point we have only talked about synthetic data. Real data is incredibly useful. It dramatically tracking quality increase, and is great for edge cases. Facial hair? Weird Camera images? Covered.

At this point we began user data submission to our models! This came with its own challenges:

Still requires good synthetic data
Prone to user error
Poor timing, incorrect expressions, labor intensive
Too many identical examples
Model overfits

Synthetic Dataset v2

Real data submissions exposed shortcomings with synthetic data. Newer data is significantly more diverse with facial hair, face shape, skin tones, lighting, image positions. We had more examples of edge cases to play with. Here is an example of a newer render:

Importantly, these frames are much faster to render. When generating millions of frames, this speedup let us test much faster. All of this culminated in Babble v2.1.0RC4, our current production face tracking model.

Motivations

Vive​

Vive Facial Tracker​

Vive Focus 3 Facial Tracker​

Vive "Squidward" Full Face Tracker​

Meta​

Quest Pro​

Pico​

Pro 4 Business/Enterprise​

History

Datasets

The Original Dataset​

Synthetic Dataset v1​

Synthetic Dataset v2​