Teaser GIF

MDMP enables long-term human motion predictions with quantifiable uncertainty.

Audio Overview (generated using NotebookLM)

Listen to a high-level overview of MDMP.

Abstract

This paper introduces a Multi-modal Diffusion model for Motion Prediction (MDMP) that integrates and synchronizes skeletal data and textual descriptions of actions to generate refined long-term motion predictions with quantifiable uncertainty.

Existing methods for motion forecasting or generation rely solely on motion or text inputs, limiting precision or control over extended durations. Our multi-modal approach enhances contextual understanding, while a graph-based transformer effectively captures spatio-temporal dynamics. Consequently, MDMP consistently outperforms existing methods in accurately predicting long-term motions. By leveraging diffusion models’ ability to capture different modes of prediction, we estimate uncertainty and significantly improve spatial awareness in human-robot interactions.

Method

Method Summary

As part of the Diffusion Process MDMP progressively denoises a motion sample conditioned by an input motion through masking. Our architecture employs a GCN encoder to capture spatial joint features. We encode text prompts using CLIP followed by a linear layer; the textual embedding c and the noise time-step t are projected to the same dimensional latent space by separate feed-forward networks. These features, summed with a sinusoidal positional embedding, are fed into a Transformer encoder-only backbone. The backbone output is projected back to the original motion dimensions via a GCN decoder. Our model is trained both conditionally and unconditionally on text, by randomly masking 10% of the text embeddings. This approach balances diversity and text-fidelity during sampling.

Our method uses the building blocks of MDM, but with three key differences: (1) a denoising model that includes variance learning to increase log-likelihood and perform uncertainty estimates, (2) the GCN encoder with learnable graph connectivity, and (3) a learning framework that incorporates contextuality by synchronizing skeletal inputs with initial textual inputs.

Results

More Visual Results to come soon.

Model Accuracy Evaluation:

Model Accuracy Evaluation
  • MDMP outperforms baseline Text2Motion models in terms of accuracy, particularly over longer sequences. Unlike baseline models like MoMask, MotionGPT, and MDM that treat motion data as a masked input during sampling, MDMP is trained to leverage it as an additional supervision signal. This leads to significant performance improvement, demonstrated by lower MPJPE values over time.
  • Integrating textual and skeletal data significantly enhances the accuracy of predictions. Ablation studies (Table 1) confirm that combining both input types results in much higher prediction accuracy.

Uncertainty Parameter Evaluation:

Uncertainty Parameter Evaluation
  • Mode Divergence proves to be the best performing uncertainty index. It closely follows the Oracle curve, indicating strong alignment between uncertainty estimates and true errors.
  • Denoising Fluctuations and Predicted Variance are less reliable indices for uncertainty estimation. While they show a general declining trend, the effect is less pronounced suggesting less reliability.

Ablation Study - Motion and Text Effects:

Ablation Study - Motion and Text Effects
  • The study confirms the importance of multimodal fusion by demonstrating increased prediction accuracy when both input types are used.
  • The model relies more heavily on motion input sequences than textual prompts for short-term predictions.
  • Textual information is most useful for longer-term predictions where the stochasticity and variability of potential scenarios are much higher.

Ablation Study - Architectural Design and Parameter Choice:

Ablation Study - Architectural Design and Parameter Choice
  • Learned Graph Connectivity improves the understanding of human joint trajectory dependencies. Using GCNs leads to better performance than linear layers, especially for longer-term predictions.
  • Learning variances allows the model to capture more data distribution modes, enhancing the accuracy over longer-term predictions.
  • Reducing the number of diffusion steps significantly improves the computational efficiency, which is pivotal for real-time Human-Robot Collaboration. This optimization also resulted in improved accuracy over time.

BibTeX

@misc{bringer2024mdmp,
      title={MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty}, 
      author={Leo Bringer and Joey Wilson and Kira Barton and Maani Ghaffari},
      year={2024},
      eprint={2410.03860},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.03860},
}