Voice Core
VIRTUAL Agent is designed to have a distinct voice that aligns with its personality and role. Therefore, training the voice models is a critical process to ensure that each character's voice is not only realistic but also consistent with their designed persona.
There are two modules used in Voice Core.
Speech-to-text module: STT module is trained with a wide range of voice data. This training allows the module to accurately transcribe various accents, dialects, and speech patterns, making it versatile and reliable in different user scenarios.
Text-to-speech module: For the TTS module, we utilize Variational Inference for Text-to-Speech (VITS) training. VITS is known for its ability to produce high-quality, natural-sounding speech. This training is particularly important for our platform, as each AI character requires a specific voice that matches its unique personality and characteristics. The VITS model allows for this level of customization and quality in voice synthesis.
Before model is trained, data processing is performed.
Techniques used for data preprocessing
Format Consistency: Having all audio files in the same format (WAV) and specifications (22050 Hz, mono) ensures consistency, which is essential for machine learning models to perform optimally. Inconsistent audio formats can lead to variability in the input data, which can confuse the model and degrade performance.
Sampling Rate Normalization (22050 Hz): The sampling rate determines how many samples per second are in the audio file. A standard sampling rate like 22050 Hz is often used because it's sufficient to capture the frequency range of human speech while keeping the file size manageable. It also aligns with the Nyquist theorem for capturing all frequencies up to 11025 Hz, which covers most of the human hearing range.
Mono Channel: Converting stereo or multi-channel audio files to mono ensures that the model trains on a single channel, which simplifies the learning process.
Last updated