bark - Generate highly realistic multilingual speech and sound effects

Introduction#

This article briefly introduces bark.

bark is a text prompt audio generation model. It provides high-quality speech synthesis services. Users need to provide text and select the desired voice and sound effects to generate audio files that meet their requirements.

Main Content#

1. What is bark#

bark

2. bark Features#

Bark can generate highly realistic multilingual speech and other audio, including music, background noise, and simple sound effects. It can also generate non-verbal communication, such as laughter, sighs, and crying.

3. Using bark#

Manually download the model parameters from the official model_zoo. If the script's automatic download is slow, the models with the suffix _2 are large-scale models, and those without the suffix are small models.

After installing the project files, the recommended project environment is Torch2.0+. It can also run smoothly on version 1.12. If you have already installed a Torch version lower than 2.0, installing the project will automatically install the latest Torchaudio, which may cause it to fail to run. Therefore, you need to manually install the corresponding version.

After installation, you can use the following code to run the test:

from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
#from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
# text_prompt = """
#      Hello, my name is Suno. And, uh — and I like pizza. [laughs] 
#      But I also have other interests such as playing tic tac toe.
# """
text_prompt = """
     [MAN]Hello everyone, I am an artificial intelligence 250, nice to meet you! [clears throat] 
     [WOMAN]Just kidding, actually I am a tom CAT who has been practicing for two and a half days.
"""
audio_array = generate_audio(text_prompt)

# save audio to disk
write_wav("bark_generation.wav", SAMPLE_RATE, audio_array)
  
# play text in notebook
#Audio(audio_array, rate=SAMPLE_RATE)

The resulting audio files are like this:

bark_generation1.wav

Like this:

bark_generation2.wav

And like this:

bark_generation3.wav

4. Conclusion#

To be honest, the effect is good. The difference between bark and traditional TTS is that TTS is faithful to the input, completely converting text to audio. Bark generates audio, which may involve imagination or self-modification. It is better not to use it in formal occasions.

Bark can automatically detect the language based on the text and generate audio. It also supports many sound effects and music generation. The effect is best for English, but the Chinese generation has a foreign accent 😂. It requires more Chinese training data for fine-tuning.

Finally#

Reference article:

Official Project

Disclaimer#

This article is only for personal learning purposes.

This article is synchronized with hblog.