Miners / Validators Modules

Miner Module

Miners will be given a module: generating character animation based on audio prompts received from validators. They can use the reference A2A models, to produce coherent motions, animating virtual characters' body parts in sync with the audio or contribute by training and using their own A2A models. The quality of the generated animation will directly impact the rewards they earn, as it will be compared against a reference output for the audio prompt.

The subnet owners also advises the miners on how to improve the performance of their output based on the reference models provided, through the following different ways (non-exhaustive):

  • Use own motion capture data, including curating data from public video source to enrich the model

  • Apply another generative model to produce more synthetic motion dataset

  • Any other improvements that could enhance the output quality

Link to Miner Module on Github: https://github.com/Virtual-Protocol/tao-vpsubnet/blob/main/docs/miner.md

Validator Module

Validators will also be given a module: providing audio prompts to the miners and evaluating the output quality of the generated animation submitted by the miners. The prompts can be chosen from a library of audio files provided by the subnet owner or (in the future) can be synthetically produced by the other subnets involving in creation of Text-to-Music or Text-to-Speech models (e.g., subnet 18 by Cortex.t).

After evaluating the output quality of the generated animation, validators will send the opinions to the subttensor network via APIs. And the Yuma Consensus will aggregate the opinions from all the validators and determine the rewards of miners (based on output) and validators (based on the ranking of miners).

Link to Validator Module on Github: https://github.com/Virtual-Protocol/tao-vpsubnet/blob/main/docs/validator.md

Evaluation protocol

One challenge we anticipate in assessing animation quality is its inherent non-determinism and subjectivity, particularly when dealing with synthetically generated audio prompts. Without a definitive standard for character movement relative to specific audio files, evaluating animation quality becomes subjective and complex, especially with AI-generated music.

To establish an objective evaluation protocol for the subnet, we've carefully designed a process that fosters innovation without imposing undue barriers. Miners will be tasked with submitting character animations corresponding audio prompts, comparing their output to reference animations considered the "source of truth."

Achieving a strong evaluation protocol will involve a phased approach. Initially, we'll collect data to build a library of audio-prompt-animation pairs, where trained dancers perform to specific audio prompts. This dataset captures the movement of key body parts at every music interval, serving as a baseline for character motion. This is used as reference animations. Moving forward, we will also be using human-pose estimation models on videos to increase the diversity and size of this evaluation dataset audio-prompt-animation pairs.

In this subnet, the output of the motion data will be required in the common standard BVH format, similar to the motion dataset format used in the reference library.

For evaluations, multiple parameters can be considered. In this phase, we consider the similarity of the generated animation against the reference animation using the Root Mean Square Error (RMSE) across the frames. In the next phase, more metrics such as the Physical Foot Contact (PFC) score and others can be used to determine the ‘realism’ and feasibility of the output animations, motion and beat syncing and more.

In the medium-to-long time horizon, we will also explore leveraging existing proprietary or better Audio-to-Animation models, such as those from Nvidia and Adobe to generate reference animations that could be considered the “source of truth”. This can be done by inputting the same audio prompts to these models and recording their character movements as reference animations. Comparing these outputs with our own models' results allows us to iteratively refine our reference coordinates, striving for alignment with established industry standards similar to AI feedback and synthetic data generation.

It's crucial to note that the reference animations are dynamic and subject to continuous improvement over time, in order to ensure ongoing relevance and accuracy in evaluating animation quality. This iterative process enables us as a subnet to maintain a robust evaluation framework while accommodating the evolving nature of audio-driven animation technology.

Lastly, as Visual-Language Models (VLMs) that can take in video improve, the evaluation framework can also include such models when they are more mature to provide human-like feedback on dance quality and creativity given that animation is a subjective item for evaluation.

Last updated