Objective

A company’s ability to understand its customers largely determines its ability to offer relevant products and services. One method of achieving this is clustering: a way of effectively segmenting the customer base, with the aim of discovering patterns of behaviour, preferences or needs common to specific groups. This segmentation allows you to optimise marketing campaigns, improve the customer experience and make more informed strategic decisions.

This work is framed within the field of unsupervised machine learning, in particular the use of clustering algorithms, a technique that allows the discovery of underlying structures in data without the need for prior labelling. Among the most common algorithms for this type of task are K-Means, DBSCAN and Hierarchical Clustering, each with its own specific characteristics and properties that make them more suitable for different contexts.

For this research work we have used a dataset composed of all the information stored by MasOrange about its customers, to which we have applied a variety of transformations to maximise the quality of the clusters obtained. These transformations have included normalisation, dimensionality reduction with techniques such as PCA and even the combination of various sources of information, even introducing information from outside MasOrange and combining it with the various data tables. These data have been separated by brand to observe the difference between the types of customers in the different sub-brands of the company.

Once the data was understood and transformed to optimise it for the problem at hand, various clustering algorithms were tested: from the ‘classics’ already mentioned such as K-Means or DBSCAN to more cutting-edge algorithms such as neural networks with MiniSOM or speed-optimised algorithms such as Affinity Propagation. A combination of qualitative and quantitative techniques was used to compare the performance of the different algorithms. On the one hand, visual representations such as box-plots and two-dimensional projections using PCA have been used to observe the separation and internal coherence of the generated clusters. On the other hand, specific metrics have been used to evaluate the quality of clustering, such as the Silhouette Score, which quantifies the degree of cohesion within clusters and the separation between them. Finally, it was concluded that the best algorithm for the problem was DBSCAN.

The results obtained from this process of customer segmentation by brand have been diverse, ranging from brands whose customers follow very specific patterns without much separation between them to brands where different attributes have marked clear groupings. Most notable is the clear importance of internet speed in determining which segment to include each customer in.

Based on the results of this customer segmentation process, a classification algorithm was created with the aim of choosing a brand to recommend to a new customer. CatBoost, a ‘boosting’ model (a combination of classification trees to correct errors in the previous ones), has been used for this purpose, and it has been possible to predict the brand to which a customer belongs with more than 80% accuracy in some of them.

BACHELOR’S THESIS BY:

ÁLVARO OBIES GARCÍA

Academic Experience

  • Computer Science and Engineering, Universidad Carlos III de Madrid (September 2021 – September 2025)

     

    Work Experience

    • Machine Learning Researcher – Universidad Carlos III de Madrid in collaboration with Grupo MasOrange (September 2024 — June 2025)


    Technical skills

    • Programming languages: Python, JavaScript, C, C#, C++, Java, Go, SQL.

    • Web development: Experienced in frontend design and backend.

    • Databases and Cloud Computing: Experienced with Google Cloud and BigQuery.

    • Version control and DevOps: GitHub, GitLab.

    • Containers: Configuration and environment deployment using Docker.