💡
Virtuals Protocol Whitepaper
Enter AppBuy Token
  • ABOUT VIRTUALS
    • About Virtuals Protocol
    • Agent Commerce Protocol
      • Technical Deep Dive
      • Full Research Paper
      • Current Status
    • Tokenization Platform
      • Modes
      • Genesis Launch
        • Genesis Points
        • Genesis Allocation Mechanics
        • Genesis Refund Policy
    • Agentic Framework (GAME)
      • GAME Documentation
  • INFO HUB
    • Builders Hub
    • Virgens Hub
      • How to Link Your X Account for Virgen Points
    • $VIRTUAL
      • Token Distribution
    • Protocol Metrics
    • Core Contributors
      • Select Research Work
    • Important Links & Resources
      • Security Audits
        • Security Policy - Responsible Disclosure
      • Contract Address
      • Further Reading
    • Editorial Style Guide / Brand Kit
Powered by GitBook
On this page
  • There are two modules used in Voice Core.
  • Techniques used for data preprocessing
  1. About Virtuals
  2. The Protocol
  3. Co-contribution and provenance
  4. Modular Consensus Framework
  5. Decentralized contribution

Voice Core

Last updated 7 months ago

VIRTUAL Agent is designed to have a distinct voice that aligns with its personality and role. Therefore, training the voice models is a critical process to ensure that each character's voice is not only realistic but also consistent with their designed persona.

There are two modules used in Voice Core.

Speech-to-text module: STT module is trained with a wide range of voice data. This training allows the module to accurately transcribe various accents, dialects, and speech patterns, making it versatile and reliable in different user scenarios.

Text-to-speech module: For the TTS module, we utilize Variational Inference for Text-to-Speech (VITS) training. VITS is known for its ability to produce high-quality, natural-sounding speech. This training is particularly important for our platform, as each AI character requires a specific voice that matches its unique personality and characteristics. The VITS model allows for this level of customization and quality in voice synthesis.

Before model is trained, data processing is performed.

Techniques used for data preprocessing

  1. Format Consistency: Having all audio files in the same format (WAV) and specifications (22050 Hz, mono) ensures consistency, which is essential for machine learning models to perform optimally. Inconsistent audio formats can lead to variability in the input data, which can confuse the model and degrade performance.

  2. Sampling Rate Normalization (22050 Hz): The sampling rate determines how many samples per second are in the audio file. A standard sampling rate like 22050 Hz is often used because it's sufficient to capture the frequency range of human speech while keeping the file size manageable. It also aligns with the Nyquist theorem for capturing all frequencies up to 11025 Hz, which covers most of the human hearing range.

  3. Mono Channel: Converting stereo or multi-channel audio files to mono ensures that the model trains on a single channel, which simplifies the learning process.

Sample Code
import os
from pydub import AudioSegment

upload_dir = 'upload_dir'
output_dir = 'out'

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

extensions = ['wav', 'mp3', 'ogg']

# Process all files in the upload directory
for filename in os.listdir(upload_dir):
    if any(filename.lower().endswith(ext) for ext in extensions):
        # Construct file paths
        file_path = os.path.join(upload_dir, filename)
        output_path = os.path.join(output_dir, os.path.splitext(filename)[0] + '.wav')

        # Load the audio file
        audio = AudioSegment.from_file(file_path)

        # Convert to WAV, 22050 Hz, mono
        audio = audio.set_frame_rate(22050).set_channels(1)

        # Export the processed audio
        audio.export(output_path, format='wav')

Learn more about contributing to Voice Core.