Emotion transfer in voice using neural networks…

Objective

The objective of this project is to develop a machine learning model to transform the emotion expressed in the voice. Emotion transfer is an open problem, and significant progress has been made thanks to the emergence of generative adversarial networks.

In this work, we propose a model that builds upon the work of [1], based on a specific type of generative adversarial networks called CycleGAN, and we incorporate some architectural updates that enhance the quality of the synthesized voices. We address three issues:

Traditionally, models have been trained on parallel data, where there are examples of utterances spoken with different emotions but sharing linguistic content. This limitation made it impossible to use real data for training and restricted the usefulness of the models. CycleGAN networks can leverage real-world data because they do not require parallel data.
Emotion in speech is mostly related to prosodic aspects of the voice, such as pitch or rhythm. To achieve effective emotion transfer it is necessary to transform these features. In this project we decompose the fundamental frequency of the voice with continuous wavelet transform, which has shown to improve the conversion of the fundamental frequency, and consequently, of prosody.
The quality of the transformed voices is inferior to the original voices. In this project, we incorporate an updated CycleGAN architecture proposed by [2] for identity transfer problems, and show that the quality of the synthesized voices improves compared to the baseline model.

The final model enhances spectrum transformation, prosody, and overall quality of the synthesized voices.

[1] K. Zhou, B. Sisman y H. Li, Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data, en, arXiv:2002.00198 [cs,eess], oct. de 2020. [En línea]. Disponible en: http://arxiv.org/abs/2002.00198 (Acceso: 04-04-2023).

[2] T. Kaneko, H. Kameoka, K. Tanaka y N. Hojo, CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion, en, arXiv:1904.04631 [cs, eess, stat], abr. de 2019. [En línea]. Disponible en: http://arxiv.org/abs/1904.04631 (Acceso: 04-04-2023).

BACHELOR’S THESIS BY:

PABLO DÍAZ LARRAÍN

Degree

Degree in Computer Science and Engineering

Work Experience

Researcher at Cátedra UC3M-MásMóvil (September 2022 – May 2023)

Technical skills

Programming languages: Python, C/C++, Javascript.

Libraries: Tensorflow, NumPy, Pandas.

Platforms: Google Cloud Platform y Vertex AI.