banner
hughie

hughie

热爱技术的小菜鸟, 记录一下所学所感

vc-lm-any2one's timbre converter

Introduction#

This article provides a brief introduction to vc-lm.

vc-lm is a voice converter that can transform anyone's voice into thousands of different voices.


Content#

1. What is vc-lm?#

This project is a voice converter that can transform anyone's voice into thousands of different voices. It uses encodec to discretize the audio into tokens and builds a transformer language model on top of these tokens. The project consists of two-stage models: AR model and NAR model. It can be trained in a self-supervised manner to generate a large amount of one-to-any parallel data, which can be used to train an Any-to-One voice conversion model. Training on target speaker data only takes 10 minutes to achieve good results.

2. vc-lm Algorithm Architecture#

Referring to the Vall-E paper, the project uses the encodec algorithm to discretize the audio into tokens and builds a transformer language model on top of these tokens. The project consists of two-stage models: AR model and NAR model.

3. vc-lm Usage and Training#

The training of vc-lm is divided into two steps: training the one-to-any generation data first, and then using this data to train the any-to-one conversion model.

Pre-training:

Refer to tools/construct_wavs_file.py to process the wav source files into files with lengths of 10 to 24 seconds.

Refer to tools/construct_dataset.py to construct the dataset.

Extract the encoder module from whisper using python tools/extract_whisper_encoder_model.py --input_model=../whisper/medium.pt --output_model=../whisper-encoder/medium-encoder.pt.

Adjust the relevant storage paths in the configuration file, and then train the AR and NAR models separately:

python run.py fit --config configs/ar_model.yaml
python run.py fit --config configs/nar_model.yaml

After training, perform inference testing:

from vc_lm.vc_engine import VCEngine
engine = VCEngine('/pathto/vc-models/ar.ckpt',
                  '/pathto/vc-models/nar.ckpt',
                  '../configs/ar_model.json',
                  '../configs/nar_model.json')
output_wav = engine.process_audio(content_wav,style_wav, max_style_len=3, use_ar=True)           

Any-to-One Training:

Construct the target dataset, similar to the previous method.

Construct the Any-to-One parallel data: python tools.construct_parallel_dataset.py.

Load the pre-trained models and train on the target dataset:

python run.py fit --config configs/finetune_ar_model.yaml
python run.py fit --config configs/finetune_nar_model.yaml

Perform inference testing:

from vc_lm.vc_engine import VCEngine
engine = VCEngine('/pathto/jr-ar.ckpt',
                  '/pathto/jr-nar.ckpt',
                  '../configs/ar_model.json',
                  '../configs/nar_model.json')
output_wav = engine.process_audio(content_wav,style_wav, max_style_len=3, use_ar=True)           

4. Conclusion#

There are pre-trained models available for download on the official project.

It is important to note that this project is different from TTS. vc-lm focuses on voice conversion and can achieve good conversion results with a small amount of target data fine-tuning.


Finally#

References:

Official Project

Vall-E

encodec


Disclaimer#

This article is for personal learning purposes only.

This article is synchronized with HBlog.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.