Skip to the content.

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers (Submitted to ICASSP 2025)

This page and the demos below are intended solely for research demonstration.

Authors

Abstract

We present NanoVoice, a personalized text-to-speech model that efficiently constructs voice adapters for multiple speakers simultaneously. NanoVoice introduces a batch-wise speaker adaptation technique capable of fine-tuning multiple references in parallel, significantly reducing training time. Beyond building separate adapters for each speaker, we also propose a parameter sharing technique that reduces the number of parameters used for speaker adaptation. By incorporating a novel trainable scale matrix, NanoVoice mitigates potential performance degradation during parameter sharing. NanoVoice achieves performance comparable to the baselines, while training 4 times faster and using 45 percent fewer parameters for speaker adaptation with 40 reference voices. Extensive ablation studies and analysis further validate the efficiency of our model.

LibriSpeech Dataset

For the model comparison with the baseline, all samples were resampled to 16kHz and normalized to -27dB to ensure fairness. For the remaining experiments, as they were based on our model, all samples were kept at the original sampling rate of 22kHz.

Model Comparison

Transcript: Why fades the lotus of the water?

Reference GT NanoVoice VoiceTailor UnitSpeech XTTS $v2$ CosyVoice

Transcript: For, like as not, they must have thought him a prince when they saw his fine cap.

Reference GT NanoVoice VoiceTailor UnitSpeech XTTS $v2$ CosyVoice

Transcript: But the general distinction is not on that account to be overlooked.

Reference GT NanoVoice VoiceTailor UnitSpeech XTTS $v2$ CosyVoice

Ablation Studies

Adapter Sharing

Transcript: What is your country Olaf? Have you always been a Thrall? The Thrall’s eyes flashed.

Reference Share None Share $B$ (default) Share $A$ Share $B$, $A$

Transcript: Nine thousand years have elapsed since she founded yours, and eight thousand since she founded ours as our annals record.

Reference Share None Share $B$ (default) Share $A$ Share $B$, $A$

Transcript: Federal judges and United States attorneys in Utah, who were not Mormons nor lovers of Mormonism, refused to entertain complaints, or prosecute cases under the law because of its manifest injustice and inadequacy.

Reference Share None Share $B$ (default) Share $A$ Share $B$, $A$

Trainable Scale Matrix

Transcript: a feeling of freedom and I was awake. Where?

Reference NanoVoice (default) w/o Normalization w/o Scale Matrix

Transcript: What is your country Olaf? Have you always been a Thrall? The Thrall’s eyes flashed.

Reference NanoVoice (default) w/o Normalization w/o Scale Matrix

Transcript: What you had best do, my child, is to keep it and pray to it that since it was a witness to your undoing, it will deign to vindicate your cause by its righteous judgment.

Reference NanoVoice (default) w/o Normalization w/o Scale Matrix

Analysis

Number of Speakers

Transcript: John Wesley Combash, Jacob Taylor, and Thomas Edward Skinner.

Reference Batch Size = 1 Batch Size = 5 Batch Size = 20 Batch Size = 40

Transcript: Not all the Galatians had become perverted.

Reference Batch Size = 1 Batch Size = 5 Batch Size = 20 Batch Size = 40

Role of Shared Matrix $B$

Transcript: The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.

Reference Mixed-Gender Same-Gender

Transcript: There is the slang of the affected lady as well as of the PRECIEUSES.

Reference Mixed-Gender Same-Gender