MOSS-TTSD: Text to Spoken Dialogue Generation

MOSS-TTSD is a spoken dialogue generation model that enables expressive dialogue speech synthesis in both Chinese and English, supporting zero-shot multi-speaker voice cloning, voice event control, and long-form speech generation.

Speech serves as a crucial interface for both human-to-human and human-machine interactions. Natural prosody, high expressiveness, and human-like speech are essential capabilities for strong artificial intelligence. However, in real-world scenarios, complex contexts impose different requirements on speech characteristics. Typically, speech properties such as prosody and style vary across different contexts. Conversational speech represents one of the most common scenarios in the real world, with many familiar audio contents existing in dialogue form, such as podcasts, interviews, sports commentary, news reports, and e-commerce live streaming. In conversations, the acoustic characteristics of individual speaker voices need to reference the entire dialogue history and specific conversational context, which poses new requirements for the modeling capabilities of generative models. While current TTS models have achieved significant progress in single-sentence speech generation, they still cannot synthesize high-quality conversational speech due to the lack of overall dialogue context.

To address this problem, we build a conversational speech synthesis model called MOSS-TTSD (Text to Spoken Dialogue), which can directly generate high-quality conversational speech from multi-speaker dialogue text input, accurately modeling characteristics such as prosody and intonation in conversations. MOSS-TTSD is based on the Qwen3-1.7B-base model, employing discrete speech sequence modeling methods. It trains on approximately one million hours of single-speaker speech data and 400,000 hours of conversational speech data, supporting bilingual speech synthesis in both Chinese and English. Benefiting from low bitrate codec and efficient data processing pipelines, MOSS-TTSD trains on massive real-world data, achieving industry-leading levels in speech naturalness and expressiveness. It supports dual-speaker voice cloning and long-form speech generation.

The model weights, inference code, and API interfaces for MOSS-TTSD-V0 are all open source and support commercial use: https://github.com/OpenMOSS/MOSS-TTSD

Welcome to try it out!

Model Overview

1 Model overview: The model is trained based on Qwen3-1.7B-base, uses an 8-layer RVQ codebook for speech discretization, employs autoregressive modeling with Delay Pattern for speech token generation, and finally uses the Tokenizer's decoder to reconstruct speech from speech tokens.

MOSS-TTSD uses fully discrete speech generation. We train an 8-layer RVQ audio codec: XY-Tokenizer, to quantize the original audio. XY-Tokenizer can simultaneously encode the semantic and acoustic information of speech and has a low bitrate (1kbps), which allows large language models (LLMs) to effectively learn audio sequences and model detailed acoustic features. In sequence modeling, inspired by MusicGen and VOICECRAFT, we use autoregressive modeling with multi-head Delay Pattern for speech token generation.

Speech Discretization: XY-Tokenizer

To unify the modeling of speech semantic and acoustic information and achieve low bitrate, we construct XY-Tokenizer, which uses two-path Whisper Encoder to encode speech, 8-layer RVQ for quantization, and two-stage multi-task learning. It achieves 1kbps bitrate and 12.5Hz frame rate.

2 XY-Tokenizer uses two-stage multi-task learning to train. The first stage (upper part) is trained with an ASR task and a reconstruction task, allowing the encoder to retain coarse acoustic information while encoding semantic information. The second stage (lower part) fixes the encoder and quantization layer, only training the decoder part. Through reconstruction loss and GAN loss, the second stage supplements fine acoustic information with generative model capabilities.

We expand the codec training data, using 100,000 hours of speech data with transcription text for training. The table below compares the performance of different codecs in semantic and acoustic performance on LibriSpeech dataset. WER is the word error rate in the ASR probing task , and the lower WER indicates better alignment with the original text. The bold represents SOTA or Sub SOTA performance in low-bitrate codec group.

Model BPS Frame Rate Semantic Acoustic
Model Semantic WER ↓ SIM ↑ STOI ↑ PESQ-NB ↑ PESQ-WB ↑
DAC-86k75No0.740.880.953.793.46
SpeechTokenizer4k50Yes0.200.840.923.052.60
Mimi-324.4k12.5Yes0.280.930.963.793.42
DAC-21.5k75No0.980.490.831.911.51
BigCodec1.04k80No0.490.840.933.262.68
Mimi-81.1k12.5Yes0.280.730.902.792.24
Baichuan1.075k12.5Yes0.100.700.882.451.93
XCodec2.00.8k50Yes0.300.820.913.032.43
XY-Tokenizer1k12.5Yes0.130.830.913.002.41
XY-Tokenizer is the best unified codec in terms of performance in both semantic and acoustic metrics among codecs with 1kbps and 12.5Hz frame rate.

To better encode and reconstruct complex dialogue audio, we expand 500,000 hours of untranscribed audio data for enhanced training, expanding the codec's ability to handle complex audio and scenarios.

Thanks to the super low bitrate of the codec, the training length of our model reached up to 960s of audio, allowing our model to generate super long speech in one go, avoiding unnatural transitions between speech segments.

Data Engineering

The performance of TTS models is closely related to the quality and quantity of training data. To scale up high-quality TTS data and TTSD data, we design an efficient data processing pipeline that can accurately screen single-person speech and multi-person dialogue speech from massive original audio.

3 Data pipeline overview: We split single-person speech and multi-person dialogue speech from massive original audio and use internal tool models for annotation.

For the original audio, we first use an internal speaker diarization model to segment and annotate speech and speakers. Based on the pre-trained base model, the performance of our speaker diarization model is better than the open source speaker diarization model pyannote-speaker-diarization-3.1 and its commercial version pyannoteAI.

Model AISHELL-4 AliMeeting(channel 1) AMI(IHM) AMI(SDM)
DER ↓
pyannote-speaker-diarization-3.1 11.7 24.7 20.5 24.3
pyannoteAI 11.1 18.3 17.5 20.0
Ours Diarization Model 9.7 14.1 14.5 17.2
Speaker Diarization model DER (Diarization Error Rate) results on different datasets (lower is better), our model achieved the best performance on four test sets.

We use DNSMOS score as the standard for evaluating speech quality. We assumed that speech with high DNSMOS score was likely not included background noise. To ensure speech quality and reduce noise, we only retain audio segments with DNSMOS>=2.8. For high-quality audio segments, we directly transcribe the speech to serve as TTS training data. Additionally, we design a set of rules to combine speaker-separated audio segments from diarization into two-speaker dialogue segments for TTSD training, which we refer to as coarse-grained dialogue segments. Although the speaker diarization model can accurately separate speakers, we find that it is not particularly sensitive to shorter backchannels, resulting in missed separations. Furthermore, current ASR models cannot accurately transcribe overlapping speech in conversations. Therefore, inspired by Parakeet, we train a Chinese version of the Whisper-d model for fine-grained speaker annotation and text transcription of Chinese data. For English data, we directly use the open-source Whisper-d from Parakeet. Finally, we use the coarse-grained labels from the speaker diarization model and the fine-grained labels from the Whisper-d model to compose short dialogue segments into longer dialogue segments.

TTS Pre-training

Model WER ↓ CER ↓ WER(Norm) ↓ CER(Norm) ↓
Seed-TTS 2.25 1.12 N/A N/A
Cosyvoice2 2.80 1.59 2.52 0.80
SparkTTS 1.99 2.12 1.69 1.44
MOSS TTS-base 1.90 1.56 1.54 0.82
Comparison of Word Error Rate (WER) on the Seed-tts-eval test set for TTS pre-training models (lower is better). Boldfaced results indicate the best and second-best performance. WER(Norm) signifies that we applied rule-based corrections to the ASR outputs targeting synonym recognition errors, reducing misjudgments caused by ASR model inaccuracies. CER(Norm) indicates that we converted Chinese text to Pinyin before calculating the character error rate, effectively representing the Phonetic Error Rate (PER) metric, which we consider a more reasonable approach. The results for SparkTTS and Cosyvoice2 are from our local re-evaluation using the official inference code.

We pre-train the model using 1.1 million hours of Chinese and English TTS data. This large-scale TTS pre-training significantly enhances the prosody and expressiveness of the TTS model while enhancing its generalization capabilities. We evaluate the performance of the pre-trained TTS model using Seed-tts-eval, achieving results comparable to the current top-tier closed-source model, Seed-TTS. The model after TTS pre-training has developed strong speech generation capabilities and zero-shot voice cloning abilities.

TTSD Post-training

Ultimately, we collect 100,000 hours of Chinese conversational data and 270,000 hours of English conversational data. Additionally, to improve the model's accuracy in speaker switching, we synthesize 40,000 hours of Chinese conversational data and 40,000 hours of English conversational data. To enhance the model's awareness of Chinese punctuation, we refine the transcribed texts in a portion of the data (approximately 70,000 hours) using Gemini.

During the training phase, we use the WSD scheduler for training based on a pre-trained TTS checkpoint, and we do not implement specific data scheduling for the decay stage. Additionally, we find it challenging to select the best-performing checkpoint using the validation set, so we instead choose the checkpoint demonstrating the best subjective performance through manual evaluation.

Demo

We test the Chinese and English dialogue generation capabilities of MOSS-TTSD. For Chinese dialogue generation, we compare it with the open-source MoonCast and the closed-source model Doubao Podcast TTS. Compared to MoonCast, MOSS-TTSD demonstrates more natural prosody, stronger expressiveness, and more stable generation performance. Furthermore, MOSS-TTSD supports multiple voice cloning methods, including uploading an entire dialogue segment or uploading audio from a single speaker. Compared to Douyin Podcast TTS, MOSS-TTSD offers zero-shot voice cloning while achieving comparable prosody and expressiveness, and provides greater customization flexibility for text input.

Dialogue Generation

Topic Prompt1 Prompt2 MoonCast MOSS-TTSD-v0
Context Scaling Discussion
Game Discussion
Asteroid TTS Talk

Voice Cloning

Topic Prompt1 Prompt2 MoonCast MOSS-TTSD-v0
贾玲 x 刘德华
潘长江 x 嘎子
邓紫棋 x 周杰伦
Elon x Jensen
Trump x Obama

AI Podcast Generation Comparison

Based on recent hot topics in the AI field, we generate a series of AI information podcasts. By comparing podcast generation from Doubao (a commercial model) with our open-source workflow, we find that the two perform comparably across multiple dimensions. Whether in terms of emotional expressiveness, naturalness of tone, or overall delivery quality, our open-source model demonstrates performance levels that rival the commercial solution. This showcases the significant potential of MOSS-TTSD in the field of text-to-speech synthesis.

Date Doubao Podcast MOSS-TTSD-v0

2025.06.17

2025.06.18

2025.06.19

Vocal Event

Category Transcription MOSS-TTSD-v0

Cough

[S1]今晚你来吗?
[S2](咳)估计去不了,人有点不舒服。
[S1]啊这样啊,那你好好休息。
[S2]没事儿,你们玩开心点。

Laugh

[S1]哎你看见刚才那人的帽子没?
[S2](笑)看见了!上头居然有个螺旋桨!
[S1]我真是服了,怎么会有人戴那种帽子。

Contributors

TTSD Training: Yuqian Zhang, Donghua Yu, Zhengyuan Lin, Yiwei Zhao, Jun Zhan, Dong Zhang
TTS Pretraining: Botian Jiang, Yiwei Zhao, Jin Wang, Yucheng Yuan, Xin Zhang
Foundation Model Pretraining: Xingjian Zhao, Zhe Xu, Hanfu Chen, Yang Wang, Yaozhou Jiang, Ruiming Wang, Cheng Chang
Codec: Yitian Gong, Ruifan Deng, Luozhijie Jin, Qinghui Gao, Dong Zhang
Infrastructure: Ruixiao Li, Mingshu Chen, Cheng Chang
Data Pipeline: Ke Chen, Wenbo Zhang, Wenxuan Wang
Data Collection: Qinghui Gao, Zhengyuan Lin, Donghua Yu, Yuqian Zhang, Zhaoye Fei
Additional Contribution: Qian Tu, Chenchen Yang, Liwei Fan, Kexin Huang
Supervision: Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu