Combining Motion Matching and Orientation Prediction to Animate Avatars for Consumer-Grade VR Devices

Paper

Code

Data

Video

Abstract

The animation of user avatars plays a crucial role in conveying their pose, gestures, and relative distances to virtual objects or other users. Consumer-grade VR devices typically include three trackers: the Head Mounted Display (HMD) and two handheld VR controllers. Since the problem of reconstructing the user pose from such sparse data is ill-defined, especially for the lower body, the approach adopted by most VR games consists of assuming the body orientation matches that of the HMD, and applying animation blending and time-warping from a reduced set of animations. Unfortunately, this approach produces noticeable mismatches between user and avatar movements. In this work we present a new approach to animate user avatars for current mainstream VR devices. First, we use a neural network to estimate the user’s body orientation based on the tracking information from the HMD and the hand controllers. Then we use this orientation together with the velocity and rotation of the HMD to build a feature vector that feeds a Motion Matching algorithm. We built a MoCap database with animations of VR users wearing a HMD and used it to test our approach on both self-avatars and other users’ avatars. Our results show that our system can provide a large variety of lower body animations while correctly matching the user orientation, which in turn allows us to represent not only forward movements but also stepping in any direction.

Method

We propose a new method to animate self-avatars using only one HMD and two hand-held controllers. Our system can be divided into three parts:

Body orientation prediction
Motion Matching
Final pose adjustments

Body Orientation Prediction

Predicting the body orientation is a common problem in applications using full-body avatars with only one HMD and two controllers. Current methods use the HMD's forward direction to orient the whole body, producing mismatches with the actual body orientation. Instead, we trained a lightweight feedforward neural network to predict the body orientation from the rotation, velocity and angular velocity of all three devices.

Predicting the orientation directly from the ground truth data would not match the real usage scenario of the network, and therefore, the network would not be learning how to predict the next orientation based on the previously predicted one. Instead, for every element in a training batch, we iteratively predict the orientation \( r \) times (e.g., \( r=50 \) ). Then, we compute the MSE loss by comparing the final predicted body orientation \( \mathbf{\hat{d}} \) with the ground truth orientation \( \mathbf{d^*} \) after \( r \) frames.

Motion Matching for VR

Motion Matching searches over an animation database for the best match for the current avatar pose and the predicted trajectory. To find the best match, we compute a new database with the main features defining locomotion. A feature vector \( \mathbf{z} \in \mathbb{R}^{27} \) is defined for each pose. This feature vector combines two types of information: the current pose and the trajectory. When comparing feature vectors, the former ensures no significant changes in the pose and thus smooth transitions; the latter drives the animation towards our target trajectory. Feature vectors are defined as follows: \begin{equation*} \mathbf{z} = \left( \mathbf{z^v}, \mathbf{z^l}, \mathbf{z^p}, \mathbf{z^d} \right) \label{eq:z} \end{equation*} where \( \mathbf{z^v}, \mathbf{z^l} \) are the current pose features and \( \mathbf{z^p}, \mathbf{z^d} \) are the trajectory features. More precisely, \( \mathbf{z^v} \in \mathbb{R}^{9} \) are the velocities of the feet and hip joints, \( \mathbf{z^l} \in \mathbb{R}^{6} \) are the positions of the feet joints, \( \mathbf{z^p} \in \mathbb{R}^{6} \) and \( \mathbf{z^d} \in \mathbb{R}^{6} \) are the future 2D positions and 2D orientations of the character \( 0.33 \) , \( 0.66 \) and \( 1.00 \) seconds ahead.

Final pose adjustments

In our work, the upper body is not considered for the Motion Matching algorithm to avoid increasing the dimensionality of the feature vector and focus instead on lower body locomotion, for which no tracking data is available in consumer-grade VR. In order to obtain the upper body pose for the arms, we can use the hand controllers as end effectors for an Inverse Kinematics algorithm. This solution is fast to compute and provides a good solution for the user to interact with the environment in VR.

Overview Video

Citation

@article {ponton2022mmvr,
journal = {Computer Graphics Forum},
title = {{Combining Motion Matching and Orientation Prediction to Animate Avatars for Consumer-Grade VR Devices}},
author = {Ponton, Jose Luis and Yun, Haoran and Andujar, Carlos and Pelechano, Nuria},
year = {2022},
volume = {41},
number = {8},
pages = {107-118},
ISSN = {1467-8659},
DOI = {10.1111/cgf.14628}
}