Listen to a high-level overview of MDMP.
This paper introduces a Multi-modal Diffusion model for Motion Prediction (MDMP) that integrates and synchronizes skeletal data and textual descriptions of actions to generate refined long-term motion predictions with quantifiable uncertainty.
Existing methods for motion forecasting or generation rely solely on motion or text inputs, limiting precision or control over extended durations. Our multi-modal approach enhances contextual understanding, while a graph-based transformer effectively captures spatio-temporal dynamics. Consequently, MDMP consistently outperforms existing methods in accurately predicting long-term motions. By leveraging diffusion models’ ability to capture different modes of prediction, we estimate uncertainty and significantly improve spatial awareness in human-robot interactions.
As part of the Diffusion Process MDMP progressively denoises a motion sample conditioned by an input motion through masking. Our architecture employs a GCN encoder to capture spatial joint features. We encode text prompts using CLIP followed by a linear layer; the textual embedding c and the noise time-step t are projected to the same dimensional latent space by separate feed-forward networks. These features, summed with a sinusoidal positional embedding, are fed into a Transformer encoder-only backbone. The backbone output is projected back to the original motion dimensions via a GCN decoder. Our model is trained both conditionally and unconditionally on text, by randomly masking 10% of the text embeddings. This approach balances diversity and text-fidelity during sampling.
Our method uses the building blocks of MDM, but with three key differences: (1) a denoising model that includes variance learning to increase log-likelihood and perform uncertainty estimates, (2) the GCN encoder with learnable graph connectivity, and (3) a learning framework that incorporates contextuality by synchronizing skeletal inputs with initial textual inputs.
More Visual Results to come soon.
@misc{bringer2024mdmp,
title={MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty},
author={Leo Bringer and Joey Wilson and Kira Barton and Maani Ghaffari},
year={2024},
eprint={2410.03860},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.03860},
}