BiSinger: Bilingual Singing Voice Synthesis

1Wuhan University, 2Duke Kunshan University
*Equal contribution
Corresponding author: ming.li369@duke.edu

Abstract

Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules. We fuse monolingual singing datasets with established singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance.

Overview

Overview of BiSinger

Demo Samples

1. Chinese Voice Synthesis

Lyrics: 同住地球村

Phonemes: T UH NG JH UW D IY Q IY UH T S UW AH N

Ground Truth (Converted) BiSinger (NUS+DB4) BiSinger (NUS) BiSinger (DB4) DiffSinger (M4-CMU-AVG)

Lyrics: 谁说站在光里的才算英雄

Phonemes: SH EY SH UW AO JH AE N Z AY G UW AE NG L IY D ER T S AY S UW AE N IY IY NG X IY UH NG

Ground Truth (Converted) BiSinger (NUS+DB4) BiSinger (NUS) BiSinger (DB4) DiffSinger (M4-CMU-AVG)

2. English Voice Synthesis

Lyrics: cry on my shoulder

Phonemes: K R AY AA N M AY SH OW L D ER

Ground Truth (Converted) BiSinger (NUS+DB4) BiSinger (NUS) BiSinger (DB4) DiffSinger (M4-CMU-AVG)

Lyrics: I wanna follow where she goes

Phonemes: AY W AA N AH F AA L OW W EH R SH IY G OW Z

Ground Truth (Converted) BiSinger (NUS+DB4) BiSinger (NUS) BiSinger (DB4) DiffSinger (M4-CMU-AVG)

3. Code-Switch Voice Synthesis

Lyrics: 开始倒数 three two one I am always online

Phonemes: K AY SH IY D AW SH UW TH R IY T UW W AH N AY AE M AO L W EY Z AO N L AY N

Ground Truth (Converted) BiSinger (NUS+DB4) BiSinger (NUS) BiSinger (DB4) DiffSinger (M4-CMU-AVG)

Lyrics: 故事细腻 romantic mystery 神秘

Phonemes: G UW SH IY X IY N IY R OW M AE N T IH K M IH S T ER IY SH AH N M IY

Ground Truth (Converted) BiSinger (NUS+DB4) BiSinger (NUS) BiSinger (DB4) DiffSinger (M4-CMU-AVG)

4. Case Study

The pronunciation of each phoneme remains isolated and cannot be connected coherently as a word. Frequent substitutions may also result in synthesized voices that sound unclear and similar to Chinese.

(1) Bad Substitution Example:

Lyrics: I'm in love with the shape of you

Ground Truth (Converted) BiSinger (NUS+DB4) BiSinger (NUS) BiSinger (DB4) DiffSinger (M4-CMU-AVG)

(2) Good Substitution Example:

Lyrics: you know you love me

Ground Truth (Converted) BiSinger (NUS+DB4) BiSinger (NUS) BiSinger (DB4) DiffSinger (M4-CMU-AVG)

To access the input for the demo samples, please click the following button:

Submit Form

Dataset Adaptation Demo

1. Timbre Conversion Method

Singing data info: NUS#MPOL#20

Lyrics: I will be right here waiting for you

Original Singer Target Singer (S) Target Singer (A) Target Singer (T) Target Singer (B)

Singing data info: NUS#ZHIY#02

Lyrics: Far, a long, long way to run

Original Singer Target Singer (S) Target Singer (A) Target Singer (T) Target Singer (B)

Refrence Chinese Songs from M4Singer Target Singer (S) Target Singer (A) Target Singer (T) Target Singer (B)

2. Pitch Shift Method

Speech data info: DB4#CN#000044

Scripts: 他们脱掉笨重的冬衣走起路来腰杆挺直步履轻盈

Original Utterance Pitch Shift Utterance

Speech Data Info: DB4#EN#300003

Scripts: When I found out about her death I was shock but not surprised she said

Original Utterance Pitch Shift

Ethics Statement

We are committed to responsible research and development practices. While our SVS system, BiSinger, holds immense potential, we are fully aware of the ethical concerns it raises, such as voice identification spoofing and impersonation. We assume user consent for speech synthesis and advocate for a protocol to ensure speaker approval and the inclusion of synthesized speech detection models when expanding to unseen speakers. By publishing our work, we aim to raise awareness, promote transparency, and mitigate the risks associated with misuse, emphasizing responsible usage and ethical standards in voice synthesis technology.

BibTeX

@inproceedings{zhou2023bisinger,
  title={BiSinger: Bilingual singing voice synthesis},
  author={Zhou, Huali and Lin, Yueqian and Shi, Yao and Sun, Peng and Li, Ming},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  pages={1--8},
  year={2023},
  organization={IEEE}
}