We introduce MS-Voice, a high-quality, monolingual Mandarin multi-domain speech dataset. The texts to be recorded is highly correlated to the scene, and the audio is recorded by professional audio actors or actresses under a clear background, which is designed to capture the nuances of expressive speech across multiple scenarios. It includes four distinct scenes—Chat, News, QA, and Storytelling—with approximately 15 hours of audio. This dataset aims to provide a comprehensive resource for training TTS models to generate speech with varying prosody, reflecting the variations encountered in day-to-day communication.
MSVoice是一个高质量的单语种普通话多场景语音数据集。数据集文字与场景高度相关,同时由专业配音员配音,于清晰背景下由专业设备录制而成。该方案旨在训练模型捕捉多场景中富有表现力的语音细微差别。数据集包含了四种不同的场景—聊天、新闻、问答和故事,时长约为15小时。 该数据集旨在提供一个全面的音频数据,用于训练TTS模型以生成具有变化韵律的语音,以适应现实生活中多场景的需求。
The dataset has four main categories, with a detailed description as follows:
- Chat: Casual Conversations Informal dialogues, interactive discussions, and crosstalk.
- QA: Question-and-answer interactions from online shopping platforms, and queries pertaining to website construction.
- News: News segments from national television broadcasts in China.
- Story: Story for children and adults, encompassing diverse storytelling styles and themes.
Dataset Stats:
The detailed dataset stats are as follows.
Scene | Number of speakers | Total time (hr) | Number of clips |
---|---|---|---|
Chat | 4 | 8.90 | 3349 |
News | 2 | 2.21 | 719 |
QA | 3 | 2.83 | 1029 |
Story | 4 | 3.86 | 1373 |
Here is the audio attributes of different scenes. The scene is highly variant on Speed, Pitch and Energy, which can all serve as a great indicator of a highly variant prosody
Dataset demo:
MS-Voice's utility lies not only in its prosodic richness but also in the uniformity of voice timbre across different prosodic contexts. This duality allows for nuanced voice synthesis, enabling TTS models to generate varied speech outputs with disentangled representations for timbre and prosody. In this part, we will show demo audios of different scenes.
Scene | Audio_Speaker1 | Audio_Speaker2 |
---|---|---|
Chat | ||
News | ||
QA | ||
Story |
Organization of the dataset:
The dataset is provided as a single set, without any predefined train-test split.
The organization of the directory structure is hierarchical, based on the scene, then the speaker. Within the speaker folder, we provides the audio with its corresponding text and pronunciation (pinyin). The following ASCII diagram depicts this structure:
Access:
Dataset will be available after review. If you want to get the full dataset, please contact the following email.