NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers (ICASSP 2025, Oral Presentation)

This page and the demos below are intended solely for research demonstration.

Authors

Nohil Park pnoil2588@snu.ac.kr
Heeseung Kim gmltmd789@snu.ac.kr
Che Hyun Lee saga1214@snu.ac.kr
Jooyoung Choi jy_choi@snu.ac.kr
Jiheum Yeom quilava1234@snu.ac.kr
Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr

Abstract

We present NanoVoice, a personalized text-to-speech model that efficiently constructs voice adapters for multiple speakers simultaneously. NanoVoice introduces a batch-wise speaker adaptation technique capable of fine-tuning multiple references in parallel, significantly reducing training time. Beyond building separate adapters for each speaker, we also propose a parameter sharing technique that reduces the number of parameters used for speaker adaptation. By incorporating a novel trainable scale matrix, NanoVoice mitigates potential performance degradation during parameter sharing. NanoVoice achieves performance comparable to the baselines, while training 4 times faster and using 45 percent fewer parameters for speaker adaptation with 40 reference voices. Extensive ablation studies and analysis further validate the efficiency of our model.

LibriSpeech Dataset

For the model comparison with the baseline, all samples were resampled to 16kHz and normalized to -27dB to ensure fairness. For the remaining experiments, as they were based on our model, all samples were kept at the original sampling rate of 22kHz.

Model Comparison

Transcript: Why fades the lotus of the water?

Reference	GT	NanoVoice	VoiceTailor	UnitSpeech	XTTS $v2$	CosyVoice

Transcript: For, like as not, they must have thought him a prince when they saw his fine cap.

Reference	GT	NanoVoice	VoiceTailor	UnitSpeech	XTTS $v2$	CosyVoice

Transcript: But the general distinction is not on that account to be overlooked.

Reference	GT	NanoVoice	VoiceTailor	UnitSpeech	XTTS $v2$	CosyVoice

Ablation Studies

Transcript: What is your country Olaf? Have you always been a Thrall? The Thrall’s eyes flashed.

Reference	Share None	Share $B$ (default)	Share $A$	Share $B$, $A$

Transcript: Nine thousand years have elapsed since she founded yours, and eight thousand since she founded ours as our annals record.

Reference	Share None	Share $B$ (default)	Share $A$	Share $B$, $A$

Transcript: Federal judges and United States attorneys in Utah, who were not Mormons nor lovers of Mormonism, refused to entertain complaints, or prosecute cases under the law because of its manifest injustice and inadequacy.

Reference	Share None	Share $B$ (default)	Share $A$	Share $B$, $A$

Trainable Scale Matrix

NanoVoice
w/o Normalization: Shared matrix $B$ with batched $A’$ multiplied by scale matrix $m’$ without normalizing with $\Vert W_0+\alpha\cdot BA’\Vert_c$.
w/o Scale Matrix: Shared matrix $B$ with batched $A’$

Transcript: a feeling of freedom and I was awake. Where?

Reference	NanoVoice (default)	w/o Normalization	w/o Scale Matrix

Transcript: What is your country Olaf? Have you always been a Thrall? The Thrall’s eyes flashed.

Reference	NanoVoice (default)	w/o Normalization	w/o Scale Matrix

Transcript: What you had best do, my child, is to keep it and pray to it that since it was a witness to your undoing, it will deign to vindicate your cause by its righteous judgment.

Reference	NanoVoice (default)	w/o Normalization	w/o Scale Matrix

Analysis

Number of Speakers

Transcript: John Wesley Combash, Jacob Taylor, and Thomas Edward Skinner.

Reference	Batch Size = 1	Batch Size = 5	Batch Size = 20	Batch Size = 40

Transcript: Not all the Galatians had become perverted.

Reference	Batch Size = 1	Batch Size = 5	Batch Size = 20	Batch Size = 40

Role of Shared Matrix $B$

Transcript: The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.

Reference	Mixed-Gender	Same-Gender

Transcript: There is the slang of the affected lady as well as of the PRECIEUSES.

Reference	Mixed-Gender	Same-Gender