Introduction#
This article provides a brief introduction to vc-lm.
vc-lm is a voice converter that can transform anyone's voice into thousands of different voices.
Content#
1. What is vc-lm?#
This project is a voice converter that can transform anyone's voice into thousands of different voices. It uses encodec to discretize the audio into tokens and builds a transformer language model on top of these tokens. The project consists of two-stage models: AR model and NAR model. It can be trained in a self-supervised manner to generate a large amount of one-to-any parallel data, which can be used to train an Any-to-One voice conversion model. Training on target speaker data only takes 10 minutes to achieve good results.
2. vc-lm Algorithm Architecture#
Referring to the Vall-E paper, the project uses the encodec algorithm to discretize the audio into tokens and builds a transformer language model on top of these tokens. The project consists of two-stage models: AR model and NAR model.
3. vc-lm Usage and Training#
The training of vc-lm is divided into two steps: training the one-to-any generation data first, and then using this data to train the any-to-one conversion model.
Pre-training:
Refer to tools/construct_wavs_file.py
to process the wav source files into files with lengths of 10 to 24 seconds.
Refer to tools/construct_dataset.py
to construct the dataset.
Extract the encoder module from whisper using python tools/extract_whisper_encoder_model.py --input_model=../whisper/medium.pt --output_model=../whisper-encoder/medium-encoder.pt
.
Adjust the relevant storage paths in the configuration file, and then train the AR and NAR models separately:
python run.py fit --config configs/ar_model.yaml
python run.py fit --config configs/nar_model.yaml
After training, perform inference testing:
from vc_lm.vc_engine import VCEngine
engine = VCEngine('/pathto/vc-models/ar.ckpt',
'/pathto/vc-models/nar.ckpt',
'../configs/ar_model.json',
'../configs/nar_model.json')
output_wav = engine.process_audio(content_wav,style_wav, max_style_len=3, use_ar=True)
Any-to-One Training:
Construct the target dataset, similar to the previous method.
Construct the Any-to-One parallel data: python tools.construct_parallel_dataset.py
.
Load the pre-trained models and train on the target dataset:
python run.py fit --config configs/finetune_ar_model.yaml
python run.py fit --config configs/finetune_nar_model.yaml
Perform inference testing:
from vc_lm.vc_engine import VCEngine
engine = VCEngine('/pathto/jr-ar.ckpt',
'/pathto/jr-nar.ckpt',
'../configs/ar_model.json',
'../configs/nar_model.json')
output_wav = engine.process_audio(content_wav,style_wav, max_style_len=3, use_ar=True)
4. Conclusion#
There are pre-trained models available for download on the official project.
It is important to note that this project is different from TTS. vc-lm focuses on voice conversion and can achieve good conversion results with a small amount of target data fine-tuning.
Finally#
References:
Disclaimer#
This article is for personal learning purposes only.
This article is synchronized with HBlog.