Objective

This project arises from the need, within the field of digital security, to determine whether an audio sample has been synthetically generated (or voice-converted) or corresponds to a human voice. Content providers and digital platforms such as social networks are increasingly facing fake audio generated through text-to-speech (TTS) systems or voice conversion (VC). This poses not only a risk to the authenticity and trustworthiness of voice-based verification systems, but also a threat to humans due to the risk of fraud and impersonation.

The main challenge lies in the fact that most current detection systems are binary—real or fake—and do not distinguish the type of attack. Additionally, their performance degrades when they are faced with audio that has undergone compression or encoding processes, which is problematic given that most audio on the Internet exists in this “degraded” state.

The objective of the project is therefore to develop a pipeline capable of classifying audio into three classes: TTS, VC, or human. In addition, we aim for it to perform well both on offline recordings and in real time, and to be robust under various complex channel conditions. As an alternative to end-to-end approaches, we rely on embeddings of clear acoustic descriptors that also enable explanatory analysis and the study of anomalous features in parallel with detection.

The proposed solution combines the extraction of acoustic descriptors (RMS, MFCCs, ZCR, fundamental frequency, etc.), feature selection through statistical tests, and the construction of an ensemble of classifiers with a decision “jury.” For real-time detection, a dual architecture is implemented using a cumulative window and a sliding window. This is complemented by an anomaly analysis module based on KL divergence. The architecture is evaluated on a representative audio set (a stratified and balanced sample generated from the ASVspoof Challenge dataset), achieving competitive results and informed real-time detections. It is shown that an embedding-based approach can generalize well even under compression and encoding conditions, as well as be deployed in real-time scenarios.

BACHELOR’S THESIS BY

ANDREA LÓPEZ SALAZAR

Academic experience

  • Dual Bachelor in Computer Science and Engineering and Business Administration, Universidad Carlos III de Madrid (september 2021 – july 2026)

     

    Work experience

    • Machine Learning Researcher – Universidad Carlos III de Madrid in collaboration with Grupo MasOrange (september 2025 — june 2026)
    • Machine Learning Researcher – Universidad Carlos III de Madrid in collaboration with DEIMOS-SPACE (january 2025 – july 2025)


    Skills

    • Programming languages: Python, C/C++, SQL, HTML/CSS, JavaScript.
    • Development libraries: Pandas, OpenCV, Numpy, PyTorch, Keras, Sci-kit Learn.
    • Cloud platforms: Google Cloud.
    • Frameworks: GitHub, GitLab.