SceneAdapter: Dive Your Voice Into Multiple Real Life Scene With A New Multi Scene Dataset

Qian Yang1,*, Jialong Zuo1,*, Zhe Su3,*, Ziyue Jiang1, Mingze Li1, Zhou Zhao1,
Feiyang Chen2, Zhefeng Wang2, Baoxing Huai2,
1Zhejiang University, 2Huawei Cloud, 3Carnegie Mellon University
*Equal Contribution

Abstract

Current adaptive text-to-speech (TTS) can synthesize high quality voice for any user. However, transferring this tailored voice to different real-life scenarios with various prosody is still a great challenge. To address this, we propose SceneAdapter, an adaptive TTS framework that leverages reference prosody speech and prompting mechanism to model scene-specific prosody and generate multiple prosody diverse speech. Along with the model, we have curated a multiscene dataset (MSVoice) featuring prosody-rich recordings in multiple real-life scenarios. We first pre-train our model in a 400 hours large mixed bilingual dataset with Chinese and English, employing a masked prediction mechanism to model basic timbre and prosody representations. After that, we fine-tune our model on the proposed multi-scene dataset to transfer scene-relevant prosody. Experimental results indicate our model can effectively synthesize speech characterized by both user-specific timbre and scene-specific prosody with arbitrary text input. Audio samples can be found in this demo page.

Model Overview

Interpolate start reference image.

The overall architecture of SceneAdapter. Duration, pitch, and energy are extracted from the prompt (In training: unmasked part; In inference: reference speech). It serves as conditions for their respective predictors. And losses are calculated only on masked part.

MS-Voice Dataset

We introduce MS-Voice, a high-quality, monolingual Mandarin multi-domain speech dataset. The texts to be recorded is highly correlated to the scene, and the audio is recorded by professional audio actors or actresses under a clear background, which is designed to capture the nuances of expressive speech across multiple scenarios. It includes four distinct scenes—Chat, News, QA, and Storytelling—with approximately 15 hours of audio. This dataset aims to provide a comprehensive resource for training TTS models to generate speech with varying prosody, reflecting the variations encountered in day-to-day communication.

MSVoice是一个高质量的单语种普通话多场景语音数据集。数据集文字与场景高度相关,同时由专业配音员配音,于清晰背景下由专业设备录制而成。该方案旨在训练模型捕捉多场景中富有表现力的语音细微差别。数据集包含了四种不同的场景—聊天、新闻、问答和故事,时长约为15小时。 该数据集旨在提供一个全面的音频数据,用于训练TTS模型以生成具有变化韵律的语音,以适应现实生活中多场景的需求。


The dataset has four main categories, with a detailed description as follows:

  • Chat: Casual Conversations Informal dialogues, interactive discussions, and crosstalk.
  • QA: Question-and-answer interactions from online shopping platforms, and queries pertaining to website construction.
  • News: News segments from national television broadcasts in China.
  • Story: Story for children and adults, encompassing diverse storytelling styles and themes.
Please note that while the first two categories represent two-person interactions, they are recorded by one speaker.

Dataset Stats:

The detailed dataset stats are as follows.

Scene Number of speakers Total time (hr) Number of clips
Chat 4 8.90 3349
News 2 2.21 719
QA 3 2.83 1029
Story 4 3.86 1373

Here is the audio attributes of different scenes. The scene is highly variant on Speed, Pitch and Energy, which can all serve as a great indicator of a highly variant prosody

Interpolate start reference image.

Dataset demo:

MS-Voice's utility lies not only in its prosodic richness but also in the uniformity of voice timbre across different prosodic contexts. This duality allows for nuanced voice synthesis, enabling TTS models to generate varied speech outputs with disentangled representations for timbre and prosody. In this part, we will show demo audios of different scenes.

Scene Audio_Speaker1 Audio_Speaker2
Chat
News
QA
Story

Organization of the dataset:

The dataset is provided as a single set, without any predefined train-test split.
The organization of the directory structure is hierarchical, based on the scene, then the speaker. Within the speaker folder, we provides the audio with its corresponding text and pronunciation (pinyin). The following ASCII diagram depicts this structure: │ ├── readme.txt │ └── data/ │ └── qa/ │ └── 客服问答一_犀牛有角/ │ ├── 0.wav │ └── 0.txt | └── 0_pinyin.txt ...

Access:

Dataset will be available after review. If you want to get the full dataset, please contact the following email.

Prosody Transfer In Adaptive TTS

Ref-Spk Ref-Prosody Text SceneAdapter

(Ref-Spk)

(Story Telling)
突然,蝴蝶停在了一朵花上,波波小心翼翼地靠近,屏住呼吸,他伸出手,轻轻地抓住了蝴蝶。
朋友们鼓励波波,只要你相信自己,不放弃,你一定能抓得到蝴蝶。

(Live Commerse)
大家好,欢迎来到今天的直播间,我是你们的主播珊珊,很高兴能够在这里与大家见面。
如果你也喜欢这个小零食,记得关注我们的直播间,我们每周都会有很多美味又健康的零食分享给大家。

(News Broadcasting)
知识产权含金量明显提升,是近年来我国知识产权高质量发展的特征之一。
作为全球首个发明专利有效量超三百万件的国家,我国发明专利有效量已位居全球第一。

(Customer Service)
您好,感谢您选择我们的产品,请问有什么我可以帮助您的吗?
我遇到了一些技术问题,无法完成我需要的任务。

(Education Teaching)
在语文学习中,阅读理解是必不可少的一部分,它是对文本深度理解与感悟的关键。
阅读理解不仅需要理解和分析文本的能力,还要求具备批判性思维和解决问题的能力

(Ref-Spk)

(Story Telling)
突然,蝴蝶停在了一朵花上,波波小心翼翼地靠近,屏住呼吸,他伸出手,轻轻地抓住了蝴蝶。
朋友们鼓励波波,只要你相信自己,不放弃,你一定能抓得到蝴蝶。

(Live Commerse)
大家好,欢迎来到今天的直播间,我是你们的主播珊珊,很高兴能够在这里与大家见面。
如果你也喜欢这个小零食,记得关注我们的直播间,我们每周都会有很多美味又健康的零食分享给大家。

(News Broadcasting)
知识产权含金量明显提升,是近年来我国知识产权高质量发展的特征之一。
作为全球首个发明专利有效量超三百万件的国家,我国发明专利有效量已位居全球第一。

(Customer Service)
您好,感谢您选择我们的产品,请问有什么我可以帮助您的吗?
我遇到了一些技术问题,无法完成我需要的任务。

(Education Teaching)
在语文学习中,阅读理解是必不可少的一部分,它是对文本深度理解与感悟的关键。
阅读理解不仅需要理解和分析文本的能力,还要求具备批判性思维和解决问题的能力

Zero-Shot Style Transfer

Note:We test the zero shot ability of style transfer to totally unseen dataset ESD as ref-prosody.


Ref-Spk Ref-Prosody Text SceneAdapter

(Ref-Spk:Mandrain)

(ESD Dataset Unseen)
当然是了,我现在快饿死了。
得了吧,别这么胆小啦。

(ESD Dataset Unseen)
当然是了,我现在快饿死了。
得了吧,别这么胆小啦。

(Ref-Spk:Mandrain)

(ESD Dataset Unseen)
听说你要去香港看你叔叔。
这可真不像是场英超比赛。

(ESD Dataset Unseen)
听说你要去香港看你叔叔。
这可真不像是场英超比赛。

Prosody Transfer In Adaptive TTS In English and English Chinese Mixture

Note:Basically,we are testing model's cross-lingual ability. As we only train English Data of VCTK in pretaining stage, finetuning style in Mandrain and fintuing speaker in Mandrain. So the english pronunciation performs no good and style transfer performs no good.


Ref-Spk Ref-Prosody Text SceneAdapter

(Ref-Spk:Mandrain)

(Story Telling:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

(Customer Service:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

(Ref-Spk:Mandrain)

(Story Telling:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

(Customer Service:Mandrain)
Who is been repeating all that hard stuff to you?
I thank you for this mercy!
Your path now goes south.
AI for Data, Data for AI等技术的落地,将使我们在海量数据中探索更深层次的价值。

Model Comparison

Ref-Spk A³T Adaspeech 4 SceneAdapter