Technical Roadmap

We will divide the development of the A2A model into several phases, with an iterative approach to make it better over time.

Phase 1: Basic Body Motion Generation (Genre Specific Audio-to-Dance)

  • Technology: In this phase, the model focuses on generating full body motions, such as dance movements, based on the audio input on a specific curated dataset.

  • Approach: This consists of using techniques like audio feature extraction to analyze the audio input and interpret the rhythm, beat per minute, and musical cues of the audio to choreograph specific movements. The model outputs animation keyframe poses and synthesize the movements that align with the audio’s tempo and style. For instance, the amplitude of the audio signal could determine the extent of movement, while the frequency might influence the speed or style of motion. Dance motions are chosen as the first focus in this model training due to its wide range of defined motions along with clear delineation of the audio’s genre and style. Any contributions to the reference audio-animation library to increase the number of animations used for validation is also encouraged at this juncture.

Phase 2: Mood and Genre Recognition (General Audio-to-Dance)

  • Technology: In this phase, the model should be able to recognise the mood and genre of the audio input and adjust the generated animation accordingly. This enhances the generative capabilities of the model to output a diverse range and styles leading to realism and expressiveness of the generated animations.

  • Approach: The model learns to identify patterns in the audio features associated with different moods (e.g., happy, sad, energetic) and genres (e.g., kpop, jazz, tap-dance, contemporary). A few other improvements will also be proposed to enhance the output quality:

    • Data-first approach taken where audio-dance data of different genres will be collected

    • Infrastructure and tooling for better, automatic data collection such as pose estimation from videos set up to scale data collection capabilities

    • Validation evaluation evolve together with the potential of using other generative models such as Visual-Language Models (VLMs) to provide reference trajectories

Phase 3: Beyond Audio-to-Dance (General Audio-to-Animation)

  • Technology: This phase aims to develop a general-purpose A2A models which further expands models capabilities beyond dance audio and motions, to things like regular speech and interaction audio for interactive gestures.

  • Approach: Language modelling and the GPT approach that has been successful for text-based generative models will be used. Obtaining a diverse and high-quality set of curated audio-animation data is the focus at this stage, ranging from dance, gestures and potentially even text-guided animation.

Phase 4: Audio-to-Video

  • Technology: This phase aims to expand the technology of General Audio-to-Animations into Audio-to-Video (beyond just character movements). This opens up more possibilities for creative motion output, into movies, films, media, and more.

  • Approach: Explore the expansion of technology into the Audio-to-Video domain, facilitating the seamless transition from individual character animations to immersive scene creation. This would also involve the fluid flow of scenes, enriching storytelling across various mediums and expanding creative expressions.

Last updated