vits_chinese - Fluent and Clear Chinese TTS

Preface#

This article briefly introduces vits_chinese.

vits_chinese uses BERT and VITS for TTS, and incorporates some natural language features to achieve high-quality text-to-audio conversion, while also supporting real-time output.

Main Content#

1. What is VITS#

VITS is a voice synthesis model that uses variational inference and adversarial training processes to generate more natural audio than current two-stage TTS systems. The model uses a random duration predictor to synthesize speech with different rhythms from input text, and expresses a natural one-to-many relationship through probability modeling and random duration prediction, where the input text can be pronounced with different tones and rhythms.

The following is the training process diagram of vits:

12-vits_train

The following is the inference process of vits:

12-vits_infer

vits_chinese uses vits as the model framework and BERT as the internal base component, which allows for more natural pauses, fewer sound errors, and high audio quality.

2. Features of vits_chinese#

Use BERT to obtain natural pauses and achieve natural speech.
Infer loss based on natural speech to reduce sound errors.
Use the framework of VITS to provide high-quality audio.

3. Usage and Training of vits_chinese#

You can experience the demo online.

If you want to train and test it yourself, you need to install the project first:

pip install -r requirements.txt
cd monotonic_align
python setup.py build_ext --inplace

Pretrained models can be downloaded from the huggingface project or from the drive.

For inference, please execute: python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth

To train, you can download the baker data, change the waveform's sampling rate to 16kHz, put the waveform into ./data/waves, put 000001-010000.txt into ./data, and then run python vits_prepare.py -c ./configs/bert_vits.json and python train.py -c configs/bert_vits.json -m bert_vits for training.

If you want to customize the training data, you only need to format the data according to the project's requirements.

4. Conclusion#

The effectiveness of the project depends entirely on the coverage of the corpus dataset, and this project is not considered a complete TTS project.

The project only has one female voice, if you want more options, you need to collect and train the corpus yourself. Overall, the effect is very good.

Finally#

Reference articles:

Official Project

VITS

Disclaimer#

This article is for personal learning purposes only.

This article is synchronized with hblog.