Voice Conversion

Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.

Source: Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

Benchmarks

Greatest papers with code

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

6 Oct 2020 • espnet/espnet • 

This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.

SPEECH RECOGNITION VOICE CONVERSION

 

Mel-spectrogram augmentation for sequence to sequence voice conversion

6 Jan 2020 • makcedward/nlpaug • 

In addition, we proposed new policies (i. e., frequency warping, loudness and time length control) for more data variations.

VOICE CONVERSION

 

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

23 Sep 2017 • r9y9/gantts • 

In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.

SPEECH SYNTHESIS VOICE CONVERSION

 

Unsupervised Speech Decomposition via Triple Information Bottleneck

ICML 2020 • auspicious3000/autovc • 

Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.

STYLE TRANSFER VOICE CONVERSION

 

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

14 May 2019 • auspicious3000/autovc • 

On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.

STYLE TRANSFER VOICE CONVERSION

 

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

NeurIPS 2019 • liusongxiang/StarGAN-Voice-Conversion • 

End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.

AUDIO GENERATION VOICE CONVERSION

 

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

6 Jun 2018 • liusongxiang/StarGAN-Voice-Conversion • 

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN.

VOICE CONVERSION

 

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

5 Jun 2020 • andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning • 

To explore this issue, we proposed to employ Mockingjay, a self-supervised learning based model, to protect anti-spoofing models against adversarial attacks in the black-box scenario.

SELF-SUPERVISED LEARNING SPEAKER VERIFICATION VOICE CONVERSION

 

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

17 Apr 2019 • aliutkus/speechmetrics

In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.

TEST RESULTS VOICE CONVERSION

 

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

10 Apr 2019 • jjery2243542/adaptive_voice_conversion • 

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers.

VOICE CONVERSION

 

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

9 Apr 2018 • jjery2243542/voice_conversion • 

The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.

VOICE CONVERSION

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

4 Apr 2017 • JeremyCCHsu/vae-npvc • 

Building a voice conversion (VC) system from non-parallel speech corpora is challenging but highly valuable in real application scenarios.

VOICE CONVERSION

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

13 Oct 2016 • JeremyCCHsu/vae-npvc • 

We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora.

VOICE CONVERSION

Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

7 May 2020 • mindslab-ai/cotatron • 

We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation.

VOICE CONVERSION

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

28 May 2019 • andi611/ZeroSpeech-TTS-without-T • 

We found that the proposed encoding method offers automatic extraction of speech content from speaker style, and is sufficient to cover full linguistic content in a given language.

ADVERSARIAL TRAINING VOICE CONVERSION

MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms

8 Oct 2019 • marcoppasini/MelGAN-VC • 

We propose MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice.

MUSIC STYLE TRANSFER VOICE CONVERSION

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

29 Jul 2019 • SamuelBroughton/StarGAN-Voice-Conversion-2 • 

To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2.

VOICE CONVERSION

Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion

22 Jan 2020 • unilight/cdvae-vc • 

In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech.

ADVERSARIAL TRAINING VOICE CONVERSION

Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders

29 Aug 2018 • unilight/cdvae-vc • 

An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner.

VOICE CONVERSION

Scalable Factorized Hierarchical Variational Autoencoder Training

9 Apr 2018 • wnhsu/ScalableFHVAE • 

Deep generative models have achieved great success in unsupervised learning with the ability to capture complex nonlinear relationships between latent generating factors and observations.

HYPERPARAMETER OPTIMIZATION ROBUST SPEECH RECOGNITION SPEAKER VERIFICATION VOICE CONVERSION

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

9 Apr 2019 • jackaduma/CycleGAN-VC2 • 

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data.

VOICE CONVERSION

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

30 Nov 2017 • jackaduma/CycleGAN-VC2 • 

A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

VOICE CONVERSION

Deep Residual Neural Networks for Audio Spoofing Detection

30 Jun 2019 • nesl/asvspoof2019

Additionally, replay attacks where the attacker uses a speaker to replay a previously recorded genuine human speech are also possible.

SPEAKER VERIFICATION SPEECH SYNTHESIS VOICE CONVERSION

ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks

1 Apr 2019 • jefflai108/ASSERT • 

We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT).

FEATURE ENGINEERING VOICE CONVERSION

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

24 Jul 2019 • patrickltobing/cyclevae-vc • 

In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized.

VOICE CONVERSION

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

7 Jun 2020 • ericwudayi/SkipVQVC • 

Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content.

ADVERSARIAL TRAINING QUANTIZATION VOICE CONVERSION

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

10 Aug 2020 • KunZhou9646/Singing-Voice-Conversion-with-conditional-VAW-GAN • 

We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content.

VOICE CONVERSION

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

15 Apr 2020 • CODEJIN/AutoVC • 

Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.

STYLE TRANSFER VOICE CONVERSION

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

30 Oct 2018 • b04901014/ISGAN • 

This paper focuses on using voice conversion (VC) to improve the speech intelligibility of surgical patients who have had parts of their articulators removed.

VOICE CONVERSION

ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder

13 Aug 2018 • aoixcat/ACVAE-VC • 

Such situations can be avoided by introducing an auxiliary classifier and training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier.

VOICE CONVERSION

Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion

30 Sep 2020 • cjerry1243/TransferLearning-CLVC • 

Cross-lingual voice conversion (VC) is a task that aims to synthesize target voices with the same content while source and target speakers speak in different languages.

TRANSFER LEARNING VOICE CONVERSION

CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion

18 Aug 2020 • Maitreyapatel/speech-conversion-between-different-modalities • 

The CycleGAN-based method uses two different models, one for Mel Cepstral Coefficients (MCC) mapping, and another for F0 prediction, where F0 is highly dependent on the pre-trained model of MCC mapping.

VOICE CONVERSION

Robust Training of Vector Quantized Bottleneck Models

18 May 2020 • distsup/DistSup • 

We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs.

LATENT VARIABLE MODELS UNSUPERVISED REPRESENTATION LEARNING VOICE CONVERSION

Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants

9 Aug 2019 • RanyaJumah/PP_Speech_Analysis • 

The voice signal is a rich resource that discloses several possible states of a speaker, such as emotional state, confidence and stress levels, physical condition, age, gender, and personal traits.

EMOTION RECOGNITION SPEECH RECOGNITION VOICE CONVERSION

Voice Conversion using Convolutional Neural Networks

27 Oct 2016 • ShariqM/smcnn

The human auditory system is able to distinguish the vocal source of thousands of speakers, yet not much is known about what features the auditory system uses to do this.

VOICE CONVERSION

Vocoder-free End-to-End Voice Conversion with Transformer Network

5 Feb 2020 • kaen2891/kaen2891.github.io

The additional pre/post processing such as MFB and vocoder is not essential to convert real human speech to others.

SPEECH RECOGNITION VOICE CONVERSION

STC Antispoofing Systems for the ASVspoof2019 Challenge

11 Apr 2019 • ozora-ogino/LCNN • 

We enhanced Light CNN architecture previously considered by the authors for replay attacks detection and which performed high spoofing detection quality during the ASVspoof2017 challenge.

SPEECH SYNTHESIS VOICE CONVERSION