Objective
The system developed approaches spam detection as a problem that not only requires classifying messages, but also understanding their internal structure. Rather than limiting itself to a basic textual representation, the model starts from the extraction of bigrams, i.e., pairs of consecutive words, in order to better capture the local context of language. This allows it to detect expressive patterns that are often repeated in spam messages and that would frequently be lost if only isolated words were used.
From this initial representation, the system seeks to uncover the thematic organization of the corpus. To do so, documents are projected into a more compact space using dimensionality reduction with UMAP, which preserves similarity relationships between messages in a much more manageable environment. This step is important because textual data typically exists in very sparse, high-dimensional spaces where it is difficult to directly identify useful structures.
On this reduced space, HDBSCAN is applied, a density-based clustering algorithm that allows the detection of groups of similar documents while also identifying noise or messages that do not clearly belong to any group. Unlike more rigid methods, HDBSCAN can adapt to clusters of varying density, which is especially useful in a problem such as spam detection, where not all patterns have the same frequency or consistency.
Once the clusters are obtained, the system constructs cluster prototypes that act as representative summaries of each group. These prototypes make it possible to interpret what semantically characterizes each set of messages and serve as a reference for describing the thematic structure discovered in the corpus. From there, each document is no longer represented solely by its original bigrams, but also by its relationship with the detected topics or clusters, thereby generating a layer of document-level and topic-level embeddings that significantly enrich the representation of each message.
The final prediction stage is carried out using XGBoost, which acts as a supervised classifier capable of combining both the initial textual features and the new signals derived from topic modeling, prototypes, and embeddings. In this way, the final decision does not rely solely on the presence of specific words, but on a much more structured representation in which local context, thematic affinity, and relative membership to global patterns in the corpus are integrated. This gives the model greater ability to robustly discriminate between legitimate messages and spam.
Finally, the system incorporates an explainability layer aimed at understanding why the model makes a specific decision. This component makes it possible to analyze which variables, topics, or signals have been most influential in each prediction, both at a global and local level. As a result, the model not only achieves strong classification performance but also provides transparency, interpretability, and analytical insight—features that are especially valuable when working with automated systems that must be understandable and justifiable.
BACHELOR’S THESIS BY
MARIO COLLAZO NARANJO
Academic experience
- Computer Science and Engineering Degree, Universidad Carlos III de Madrid (september 2022 – september 2026)
Work experience
- AI researcher – Universidad Carlos III de Madrid along with Grupo MasOrange (october 2025 — june 2026)
Skills
-
Programming languages: Python, Java, C++, JavaScript, Go, MATLAB
-
Libraries: Pandas, NumPy, TensorFlow, Keras, Scikit-Learn
- Development: Spring Boot, Spring MVC, Maven
-
Databases and Cloud-Computing: AWS, Amazon S3, Oracle, BigQuery, PostgreSQL, PL/SQL
-
Version control and DevOps: GitHub, GitLab
