NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers (Submitted to ICASSP 2025)
This page and the demos below are intended solely for research demonstration.
Authors
- Nohil Park pnoil2588@snu.ac.kr
- Heeseung Kim gmltmd789@snu.ac.kr
- Che Hyun Lee saga1214@snu.ac.kr
- Jooyoung Choi jy_choi@snu.ac.kr
- Jiheum Yeom quilava1234@snu.ac.kr
- Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr
Abstract
We present NanoVoice, a personalized text-to-speech model that efficiently constructs voice adapters for multiple speakers simultaneously. NanoVoice introduces a batch-wise speaker adaptation technique capable of fine-tuning multiple references in parallel, significantly reducing training time. Beyond building separate adapters for each speaker, we also propose a parameter sharing technique that reduces the number of parameters used for speaker adaptation. By incorporating a novel trainable scale matrix, NanoVoice mitigates potential performance degradation during parameter sharing. NanoVoice achieves performance comparable to the baselines, while training 4 times faster and using 45 percent fewer parameters for speaker adaptation with 40 reference voices. Extensive ablation studies and analysis further validate the efficiency of our model.
LibriSpeech Dataset
For the model comparison with the baseline, all samples were resampled to 16kHz and normalized to -27dB to ensure fairness. For the remaining experiments, as they were based on our model, all samples were kept at the original sampling rate of 22kHz.
Model Comparison
Transcript: Why fades the lotus of the water?
Reference | GT | NanoVoice | VoiceTailor | UnitSpeech | XTTS $v2$ | CosyVoice |
---|---|---|---|---|---|---|
Transcript: For, like as not, they must have thought him a prince when they saw his fine cap.
Reference | GT | NanoVoice | VoiceTailor | UnitSpeech | XTTS $v2$ | CosyVoice |
---|---|---|---|---|---|---|
Transcript: But the general distinction is not on that account to be overlooked.
Reference | GT | NanoVoice | VoiceTailor | UnitSpeech | XTTS $v2$ | CosyVoice |
---|---|---|---|---|---|---|
Ablation Studies
Adapter Sharing
Transcript: What is your country Olaf? Have you always been a Thrall? The Thrall’s eyes flashed.
Reference | Share None | Share $B$ (default) | Share $A$ | Share $B$, $A$ |
---|---|---|---|---|
Transcript: Nine thousand years have elapsed since she founded yours, and eight thousand since she founded ours as our annals record.
Reference | Share None | Share $B$ (default) | Share $A$ | Share $B$, $A$ |
---|---|---|---|---|
Transcript: Federal judges and United States attorneys in Utah, who were not Mormons nor lovers of Mormonism, refused to entertain complaints, or prosecute cases under the law because of its manifest injustice and inadequacy.
Reference | Share None | Share $B$ (default) | Share $A$ | Share $B$, $A$ |
---|---|---|---|---|
Trainable Scale Matrix
- NanoVoice
- w/o Normalization: Shared matrix $B$ with batched $A’$ multiplied by scale matrix $m’$ without normalizing with $\Vert W_0+\alpha\cdot BA’\Vert_c$.
- w/o Scale Matrix: Shared matrix $B$ with batched $A’$
Transcript: a feeling of freedom and I was awake. Where?
Reference | NanoVoice (default) | w/o Normalization | w/o Scale Matrix |
---|---|---|---|
Transcript: What is your country Olaf? Have you always been a Thrall? The Thrall’s eyes flashed.
Reference | NanoVoice (default) | w/o Normalization | w/o Scale Matrix |
---|---|---|---|
Transcript: What you had best do, my child, is to keep it and pray to it that since it was a witness to your undoing, it will deign to vindicate your cause by its righteous judgment.
Reference | NanoVoice (default) | w/o Normalization | w/o Scale Matrix |
---|---|---|---|
Analysis
Number of Speakers
Transcript: John Wesley Combash, Jacob Taylor, and Thomas Edward Skinner.
Reference | Batch Size = 1 | Batch Size = 5 | Batch Size = 20 | Batch Size = 40 |
---|---|---|---|---|
Transcript: Not all the Galatians had become perverted.
Reference | Batch Size = 1 | Batch Size = 5 | Batch Size = 20 | Batch Size = 40 |
---|---|---|---|---|
Role of Shared Matrix $B$
Transcript: The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.
Reference | Mixed-Gender | Same-Gender |
---|---|---|
Transcript: There is the slang of the affected lady as well as of the PRECIEUSES.
Reference | Mixed-Gender | Same-Gender |
---|---|---|