VITS-fast-fine-tuning - Quickly Clone Custom Role Voices

Introduction#

This article provides a brief introduction to VITS-fast-fine-tuning.

VITS-fast-fine-tuning is a fine-tuning training library for VITS that allows for the rapid cloning of desired character voices.

It is a fast way to clone the voices of characters in audio.

Voice conversion between any two characters in the model;
Text-to-speech (TTS) for custom character voices in Chinese, Japanese, and English.
Supports various fine-tuning methods:
- Cloning character voices from more than 10 short audio clips
- Cloning character voices from audio clips longer than 3 minutes (individual audio clips can only contain a single speaker)
- Cloning character voices from videos longer than 3 minutes (individual videos can only contain a single speaker)
- Cloning character voices by inputting a Bilibili video link

Prepare the data
Train online using Google Colab
Alternatively, train locally by following the tutorial. This method requires CUDA dependencies, downloading project code and pre-trained models, and is more complicated. Training with Colab is simpler.

Download the fine-tuned model and config files.
Download the latest release package (on the right side of the GitHub page).
Place the downloaded model and config files in the inference folder, with the filenames G_latest.pth and finetune_speaker.json, respectively.
Once everything is ready, the file structure should look as follows:

inference
├───inference.exe
├───...
├───finetune_speaker.json
└───G_latest.pth

Run inference.exe. A browser window will automatically pop up. Note that the path to the file should not contain any Chinese characters or spaces.
Please note that the voice conversion feature requires the installation of ffmpeg to function properly.

This project simplifies the process of fine-tuning custom character voices and provides a packaged program for easy use with pre-trained models.

References:

This article is for personal learning purposes only.

This article is synchronized with HBlog.