MDMP: Multi-modal Diffusion for Supervised Motion Predictions with Uncertainty

Audio Overview (generated using NotebookLM)

Listen to a high-level overview of MDMP.

Abstract

This paper introduces a Multi-modal Diffusion model for Motion Prediction (MDMP) that integrates and synchronizes skeletal data and textual descriptions of actions to generate refined long-term motion predictions with quantifiable uncertainty.

Existing methods for motion forecasting or generation rely solely on motion or text inputs, limiting precision or control over extended durations. Our multi-modal approach enhances contextual understanding, while a graph-based transformer effectively captures spatio-temporal dynamics. Consequently, MDMP consistently outperforms existing methods in accurately predicting long-term motions. By leveraging diffusion models' ability to capture different modes of prediction, we estimate uncertainty and significantly improve spatial awareness in human-robot interactions.

Method

As part of the Diffusion Process MDMP progressively denoises a motion sample conditioned by an input motion through masking. Our architecture employs a GCN encoder to capture spatial joint features. We encode text prompts using CLIP followed by a linear layer; the textual embedding c and the noise time-step t are projected to the same dimensional latent space by separate feed-forward networks. These features, summed with a sinusoidal positional embedding, are fed into a Transformer encoder-only backbone. The backbone output is projected back to the original motion dimensions via a GCN decoder. Our model is trained both conditionally and unconditionally on text, by randomly masking 10% of the text embeddings. This approach balances diversity and text-fidelity during sampling.

Our method uses the building blocks of MDM, but with three key differences: (1) a denoising model that includes variance learning to increase log-likelihood and perform uncertainty estimates, (2) the GCN encoder with learnable graph connectivity, and (3) a learning framework that incorporates contextuality by synchronizing skeletal inputs with initial textual inputs.

Results

More Visual Results to come soon.

Model Accuracy Evaluation:

MDMP outperforms baseline Text2Motion models in terms of accuracy, particularly over longer sequences. Unlike baseline models like MoMask, MotionGPT, and MDM that treat motion data as a masked input during sampling, MDMP is trained to leverage it as an additional supervision signal. This leads to significant performance improvement, demonstrated by lower MPJPE values over time.
Integrating textual and skeletal data significantly enhances the accuracy of predictions. Ablation studies (Table 1) confirm that combining both input types results in much higher prediction accuracy.

Uncertainty Parameter Evaluation:

Mode Divergence proves to be the best performing uncertainty index. It closely follows the Oracle curve, indicating strong alignment between uncertainty estimates and true errors.
Denoising Fluctuations and Predicted Variance are less reliable indices for uncertainty estimation. While they show a general declining trend, the effect is less pronounced suggesting less reliability.

Ablation Study - Motion and Text Effects:

The study confirms the importance of multimodal fusion by demonstrating increased prediction accuracy when both input types are used.
The model relies more heavily on motion input sequences than textual prompts for short-term predictions.
Textual information is most useful for longer-term predictions where the stochasticity and variability of potential scenarios are much higher.

Ablation Study - Architectural Design and Parameter Choice:

Learned Graph Connectivity improves the understanding of human joint trajectory dependencies. Using GCNs leads to better performance than linear layers, especially for longer-term predictions.
Learning variances allows the model to capture more data distribution modes, enhancing the accuracy over longer-term predictions.
Reducing the number of diffusion steps significantly improves the computational efficiency, which is pivotal for real-time Human-Robot Collaboration. This optimization also resulted in improved accuracy over time.

BibTeX

@misc{bringer2024mdmp,
      title={MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty}, 
      author={Leo Bringer and Joey Wilson and Kira Barton and Maani Ghaffari},
      year={2024},
      eprint={2410.03860},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.03860},
}