Development of a system for telephone call processing: transcription, diarization and classification by LLM

Objective

The massive volume of daily calls in telcos presents significant challenges in terms of data management and analysis. Switching these calls to text manually is a laborious and costly process that results in delays and errors. In addition, the lack of analysis of customer interactions prevents the identification of trends and recurring problems that may affect service quality.

Therefore, an automated call processing system has been developed, consisting of a transcription and diarization model that allows the transition from audio to text, in addition to an LLM that categorizes the call and classifies it according to predefined classes.

The transcription model used is OpenAI’s Whisper, and after several tests it was concluded that the best is its Large-v2 model, which offers errors (WER) of 10% on average. Although good results are obtained, this is a model sensitive to audio quality, background noise, accents and technicalities. It needs the audio as input and as output it provides the words pronounced in the audio with their time stamps.

The transcription model is Nvidia’s NeMo, which offers an innovative mechanism to obtain the characteristics of the speaker’s voice through its Multi-scale Diarization Decoder. In addition, Nvidia offers specific parameters for the telephone call domain. The results obtained with this model are good, but its accuracy decreases when there are interruptions and overlaps. It requires audio as input and returns speech segments as output, indicating their start, duration and the speaker who pronounces them.

Once the timestamps of the words and speech segments are available, the spoken words are assigned to each speaker, thus completing the audio-to-text transition.

Finally, the text is taken and classified into pre-established categories of general subject matter Customer retention. Calls are classified according to the reason and sub-reason why the customer decided to make the call to unsubscribe. Different language models such as GPT, Gemini, LlaMA and Gemma are compared, obtaining the best results with gpt-3.5-turbo.

It is essential to develop a good prompt, with which the model understands the task to be performed and to establish clear, concrete and different categories that do not give rise to conclusions. In this way, good results are achieved, confirming the use of this type of tools for text classification.

In conclusion, this project provides value by creating a system that allows the passage from audio to text, to classify them later within defined categories. It also lays the groundwork for possible future work in search of improved accuracy and performance, or enhanced functionality.

BACHELOR’S THESIS BY:

CARLOS CAMARERO FUENTE

Academic Experience

Double Degree in Computer Engineering and Business Administration, Universidad Carlos III de Madrid (September 2018 – June 2024)

Work Experience

AI Tech Specialist – Grupo MásMóvil (September 2024 – Present)
Machine Learning Researcher – Universidad Carlos III de Madrid in collaboration with Grupo MásMóvil (September 2023 – May 2024)

Awards and Certifications

Community of Madrid Excellence Scholarship 2022/23
Best Creative Idea in Digital Transformation in the field of Occupational Health and Safety (2022)
Community of Madrid Excellence Scholarship 2021/22
Community of Madrid Excellence Scholarship 2020/21
Community of Madrid Excellence Scholarship 2019/20
Community of Madrid Excellence Scholarship 2018/19
Community of Madrid Excellence Scholarship 2018

Technical skills

Programming languages: Python, C/C++, SQL, HTML/CSS/JS.
Development libraries: Pandas, Numpy, Tensorflow, Sci-kit Learn.
Cloud Platforms: Google Cloud.
Frameworks: Git, Docker.

LinkedIn