MOSS-TTSD is a spoken dialogue generation model that enables expressive dialogue speech synthesis in both Chinese and English, supporting zero-shot multi-speaker voice cloning, voice event control, and long-form speech generation.
Speech serves as a crucial interface for both human-to-human and human-machine interactions. Natural prosody, high expressiveness, and human-like speech are essential capabilities for strong artificial intelligence. However, in real-world scenarios, complex contexts impose different requirements on speech characteristics. Typically, speech properties such as prosody and style vary across different contexts. Conversational speech represents one of the most common scenarios in the real world, with many familiar audio contents existing in dialogue form, such as podcasts, interviews, sports commentary, news reports, and e-commerce live streaming. In conversations, the acoustic characteristics of individual speaker voices need to reference the entire dialogue history and specific conversational context, which poses new requirements for the modeling capabilities of generative models. While current TTS models have achieved significant progress in single-sentence speech generation, they still cannot synthesize high-quality conversational speech due to the lack of overall dialogue context.
To address this problem, we build a conversational speech synthesis model called MOSS-TTSD (Text to Spoken Dialogue), which can directly generate high-quality conversational speech from multi-speaker dialogue text input, accurately modeling characteristics such as prosody and intonation in conversations. MOSS-TTSD is based on the Qwen3-1.7B-base model, employing discrete speech sequence modeling methods. It trains on approximately one million hours of single-speaker speech data and 400,000 hours of conversational speech data, supporting bilingual speech synthesis in both Chinese and English. Benefiting from low bitrate codec and efficient data processing pipelines, MOSS-TTSD trains on massive real-world data, achieving industry-leading levels in speech naturalness and expressiveness. It supports dual-speaker voice cloning and long-form speech generation.
The model weights, inference code, and API interfaces for MOSS-TTSD-V0 are all open source and support commercial use: https://github.com/OpenMOSS/MOSS-TTSD
Welcome to try it out!
MOSS-TTSD uses fully discrete speech generation. We train an 8-layer RVQ audio codec: XY-Tokenizer, to quantize the original audio.
XY-Tokenizer can simultaneously encode the semantic and acoustic information of speech and has a low bitrate (1kbps), which allows large language models (LLMs) to effectively learn audio sequences and model detailed acoustic features.
In sequence modeling, inspired by MusicGen
To unify the modeling of speech semantic and acoustic information and achieve low bitrate, we construct XY-Tokenizer, which uses two-path Whisper Encoder to encode speech, 8-layer RVQ for quantization, and two-stage multi-task learning. It achieves 1kbps bitrate and 12.5Hz frame rate.
We expand the codec training data, using 100,000 hours of speech data with transcription text for training.
The table below compares the performance of different codecs in semantic and acoustic performance on LibriSpeech dataset.
WER is the word error rate in the ASR probing task
Model | BPS | Frame Rate | Semantic | Acoustic | ||||
---|---|---|---|---|---|---|---|---|
Model Semantic | WER ↓ | SIM ↑ | STOI ↑ | PESQ-NB ↑ | PESQ-WB ↑ | |||
DAC-8 | 6k | 75 | No | 0.74 | 0.88 | 0.95 | 3.79 | 3.46 |
SpeechTokenizer | 4k | 50 | Yes | 0.20 | 0.84 | 0.92 | 3.05 | 2.60 |
Mimi-32 | 4.4k | 12.5 | Yes | 0.28 | 0.93 | 0.96 | 3.79 | 3.42 |
DAC-2 | 1.5k | 75 | No | 0.98 | 0.49 | 0.83 | 1.91 | 1.51 |
BigCodec | 1.04k | 80 | No | 0.49 | 0.84 | 0.93 | 3.26 | 2.68 |
Mimi-8 | 1.1k | 12.5 | Yes | 0.28 | 0.73 | 0.90 | 2.79 | 2.24 |
Baichuan | 1.075k | 12.5 | Yes | 0.10 | 0.70 | 0.88 | 2.45 | 1.93 |
XCodec2.0 | 0.8k | 50 | Yes | 0.30 | 0.82 | 0.91 | 3.03 | 2.43 |
XY-Tokenizer | 1k | 12.5 | Yes | 0.13 | 0.83 | 0.91 | 3.00 | 2.41 |
To better encode and reconstruct complex dialogue audio, we expand 500,000 hours of untranscribed audio data for enhanced training, expanding the codec's ability to handle complex audio and scenarios.
Thanks to the super low bitrate of the codec, the training length of our model reached up to 960s of audio, allowing our model to generate super long speech in one go, avoiding unnatural transitions between speech segments.
The performance of TTS models is closely related to the quality and quantity of training data. To scale up high-quality TTS data and TTSD data, we design an efficient data processing pipeline that can accurately screen single-person speech and multi-person dialogue speech from massive original audio.
For the original audio, we first use an internal speaker diarization model to segment and annotate speech and speakers. Based on the pre-trained base model, the performance of our speaker diarization model is better than the open source speaker diarization model pyannote-speaker-diarization-3.1 and its commercial version pyannoteAI.
Model | AISHELL-4 | AliMeeting(channel 1) | AMI(IHM) | AMI(SDM) |
---|---|---|---|---|
DER ↓ | ||||
pyannote-speaker-diarization-3.1 | 11.7 | 24.7 | 20.5 | 24.3 |
pyannoteAI | 11.1 | 18.3 | 17.5 | 20.0 |
Ours Diarization Model | 9.7 | 14.1 | 14.5 | 17.2 |
We use DNSMOS score as the standard for evaluating speech quality. We assumed that speech with high DNSMOS score was likely not included background noise.
To ensure speech quality and reduce noise, we only retain audio segments with DNSMOS>=2.8.
For high-quality audio segments, we directly transcribe the speech to serve as TTS training data.
Additionally, we design a set of rules to combine speaker-separated audio segments from diarization into two-speaker dialogue segments for TTSD training, which we refer to as coarse-grained dialogue segments.
Although the speaker diarization model can accurately separate speakers, we find that it is not particularly sensitive to shorter backchannels, resulting in missed separations.
Furthermore, current ASR models cannot accurately transcribe overlapping speech in conversations.
Therefore, inspired by Parakeet
Model | WER ↓ | CER ↓ | WER(Norm) ↓ | CER(Norm) ↓ |
---|---|---|---|---|
Seed-TTS | 2.25 | 1.12 | N/A | N/A |
Cosyvoice2 | 2.80 | 1.59 | 2.52 | 0.80 |
SparkTTS | 1.99 | 2.12 | 1.69 | 1.44 |
MOSS TTS-base | 1.90 | 1.56 | 1.54 | 0.82 |
We pre-train the model using 1.1 million hours of Chinese and English TTS data.
This large-scale TTS pre-training significantly enhances the prosody and expressiveness of the TTS model while enhancing its generalization capabilities.
We evaluate the performance of the pre-trained TTS model using Seed-tts-eval
Ultimately, we collect 100,000 hours of Chinese conversational data and 270,000 hours of English conversational data. Additionally, to improve the model's accuracy in speaker switching, we synthesize 40,000 hours of Chinese conversational data and 40,000 hours of English conversational data. To enhance the model's awareness of Chinese punctuation, we refine the transcribed texts in a portion of the data (approximately 70,000 hours) using Gemini.
During the training phase, we use the WSD scheduler for training based on a pre-trained TTS checkpoint, and we do not implement specific data scheduling for the decay stage. Additionally, we find it challenging to select the best-performing checkpoint using the validation set, so we instead choose the checkpoint demonstrating the best subjective performance through manual evaluation.
We test the Chinese and English dialogue generation capabilities of MOSS-TTSD.
For Chinese dialogue generation, we compare it with the open-source MoonCast
Topic | Prompt1 | Prompt2 | MoonCast | MOSS-TTSD-v0 |
---|---|---|---|---|
Context Scaling Discussion | ||||
Game Discussion | ||||
Asteroid TTS Talk |
Topic | Prompt1 | Prompt2 | MoonCast | MOSS-TTSD-v0 |
---|---|---|---|---|
贾玲 x 刘德华 | ||||
潘长江 x 嘎子 | ||||
邓紫棋 x 周杰伦 | ||||
Elon x Jensen | ||||
Trump x Obama |
Based on recent hot topics in the AI field, we generate a series of AI information podcasts. By comparing podcast generation from Doubao (a commercial model) with our open-source workflow, we find that the two perform comparably across multiple dimensions. Whether in terms of emotional expressiveness, naturalness of tone, or overall delivery quality, our open-source model demonstrates performance levels that rival the commercial solution. This showcases the significant potential of MOSS-TTSD in the field of text-to-speech synthesis.
Date | Doubao Podcast | MOSS-TTSD-v0 |
---|---|---|
2025.06.17 |
||
2025.06.18 |
||
2025.06.19 |
Category | Transcription | MOSS-TTSD-v0 |
---|---|---|
Cough |
[S1]今晚你来吗? | |
Laugh |
[S1]哎你看见刚才那人的帽子没? |
TTSD Training: Yuqian Zhang, Donghua Yu, Zhengyuan Lin, Yiwei Zhao, Jun Zhan, Dong Zhang
TTS Pretraining: Botian Jiang, Yiwei Zhao, Jin Wang, Yucheng Yuan, Xin Zhang
Foundation Model Pretraining: Xingjian Zhao, Zhe Xu, Hanfu Chen, Yang Wang, Yaozhou Jiang, Ruiming Wang, Cheng Chang
Codec: Yitian Gong, Ruifan Deng, Luozhijie Jin, Qinghui Gao, Dong Zhang
Infrastructure: Ruixiao Li, Mingshu Chen, Cheng Chang
Data Pipeline: Ke Chen, Wenbo Zhang, Wenxuan Wang
Data Collection: Qinghui Gao, Zhengyuan Lin, Donghua Yu, Yuqian Zhang, Zhaoye Fei
Additional Contribution: Qian Tu, Chenchen Yang, Liwei Fan, Kexin Huang
Supervision: Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu