{"id":2222,"date":"2026-05-25T15:58:17","date_gmt":"2026-05-25T15:58:17","guid":{"rendered":"https:\/\/catedramasmovil.uc3m.es\/2026\/05\/25\/spam-detection-in-emails\/"},"modified":"2026-05-25T16:38:17","modified_gmt":"2026-05-25T16:38:17","slug":"spam-detection-in-emails","status":"publish","type":"post","link":"https:\/\/catedramasmovil.uc3m.es\/en\/2026\/05\/25\/spam-detection-in-emails\/","title":{"rendered":"Spam detection in emails"},"content":{"rendered":"\n[et_pb_section fb_built=&#8221;1&#8243; theme_builder_area=&#8221;post_content&#8221; _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221;][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;4.17.0&#8243; custom_padding=&#8221;0px||||false|false&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_row _builder_version=&#8221;4.17.0&#8243; _module_preset=&#8221;default&#8221; custom_padding=&#8221;0px||||false|false&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.17.0&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_gallery gallery_ids=&#8221;2168,2170,2172,2174,2176,2178,2180&#8243; fullwidth=&#8221;on&#8221; _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221; hover_enabled=&#8221;0&#8243; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221; sticky_enabled=&#8221;0&#8243;][\/et_pb_gallery][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8221;4.16&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221; custom_margin=&#8221;|auto||103px||&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.16&#8243; custom_padding=&#8221;|||&#8221; global_colors_info=&#8221;{}&#8221; custom_padding__hover=&#8221;|||&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_text _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221; custom_padding=&#8221;||0px|||&#8221; hover_enabled=&#8221;0&#8243; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221; sticky_enabled=&#8221;0&#8243;]<h2 style=\"text-align: justify\">Objective<\/h2>\n<h2 style=\"text-align: justify\"><span style=\"font-size: 14px;color: #666666\">The system developed approaches spam detection as a problem that not only requires classifying messages, but also understanding their internal structure. Rather than limiting itself to a basic textual representation, the model starts from the extraction of bigrams, i.e., pairs of consecutive words, in order to better capture the local context of language. This allows it to detect expressive patterns that are often repeated in spam messages and that would frequently be lost if only isolated words were used.  <\/span><\/h2>\n<h2 style=\"text-align: justify\"><span style=\"font-size: 14px;color: #666666\">From this initial representation, the system seeks to uncover the thematic organization of the corpus. To do so, documents are projected into a more compact space using dimensionality reduction with UMAP, which preserves similarity relationships between messages in a much more manageable environment. This step is important because textual data typically exists in very sparse, high-dimensional spaces where it is difficult to directly identify useful structures.  <\/span><\/h2>\n<h2 style=\"text-align: justify\"><span style=\"font-size: 14px;color: #666666\">On this reduced space, HDBSCAN is applied, a density-based clustering algorithm that allows the detection of groups of similar documents while also identifying noise or messages that do not clearly belong to any group. Unlike more rigid methods, HDBSCAN can adapt to clusters of varying density, which is especially useful in a problem such as spam detection, where not all patterns have the same frequency or consistency. <\/span><\/h2>\n<h2 style=\"text-align: justify\"><span style=\"font-size: 14px;color: #666666\">Once the clusters are obtained, the system constructs cluster prototypes that act as representative summaries of each group. These prototypes make it possible to interpret what semantically characterizes each set of messages and serve as a reference for describing the thematic structure discovered in the corpus. From there, each document is no longer represented solely by its original bigrams, but also by its relationship with the detected topics or clusters, thereby generating a layer of document-level and topic-level embeddings that significantly enrich the representation of each message.  <\/span><\/h2>\n<h2 style=\"text-align: justify\"><span style=\"font-size: 14px;color: #666666\">The final prediction stage is carried out using XGBoost, which acts as a supervised classifier capable of combining both the initial textual features and the new signals derived from topic modeling, prototypes, and embeddings. In this way, the final decision does not rely solely on the presence of specific words, but on a much more structured representation in which local context, thematic affinity, and relative membership to global patterns in the corpus are integrated. This gives the model greater ability to robustly discriminate between legitimate messages and spam.  <\/span><\/h2>\n<h2 style=\"text-align: justify\"><span style=\"font-size: 14px;color: #666666\">Finally, the system incorporates an explainability layer aimed at understanding why the model makes a specific decision. This component makes it possible to analyze which variables, topics, or signals have been most influential in each prediction, both at a global and local level. As a result, the model not only achieves strong classification performance but also provides transparency, interpretability, and analytical insight\u2014features that are especially valuable when working with automated systems that must be understandable and justifiable.  <\/span><\/h2>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;4.18.0&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; custom_padding=&#8221;0px||||false|false&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_row column_structure=&#8221;1_2,1_2&#8243; _builder_version=&#8221;4.25.1&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_column type=&#8221;1_2&#8243; _builder_version=&#8221;4.18.0&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_image src=&#8221;https:\/\/storage.googleapis.com\/wp-uploads.bucket.wp.uc3m.es\/wp-content\/uploads\/sites\/70\/2026\/05\/25155158\/Foto_Mario.jpg&#8221; title_text=&#8221;Foto_Mario&#8221; _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221; hover_enabled=&#8221;0&#8243; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221; sticky_enabled=&#8221;0&#8243;][\/et_pb_image][\/et_pb_column][et_pb_column type=&#8221;1_2&#8243; _builder_version=&#8221;4.18.0&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_text _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221; custom_padding=&#8221;||0px|||&#8221; hover_enabled=&#8221;0&#8243; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221; sticky_enabled=&#8221;0&#8243;]<p><span style=\"color: #003366\"><strong>BACHELOR&#8217;S THESIS BY<\/strong><\/span><\/p>\n<p><b>MARIO COLLAZO NARANJO<\/b><\/p>[\/et_pb_text][et_pb_text _builder_version=&#8221;4.25.1&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;]<p><strong><\/strong><\/p>\n<p><strong><\/strong><\/p>\n<p><strong><\/strong><\/p>\n<p><strong>Academic experience<\/strong><\/p>\n<ul>\n<li>Computer Science and Engineering Degree, Universidad Carlos III de Madrid (september 2022 \u2013 september 2026)<\/li>\n<\/ul>\n<ul><\/ul>\n<p>&nbsp;<\/p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.20.2&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_text _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221; custom_padding=&#8221;||9px|||&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;]<p><strong>Work experience<\/strong><\/p>\n<ul>\n<li>AI researcher \u2013 Universidad Carlos III de Madrid along with Grupo MasOrange (october 2025 \u2014 june 2026)<\/li>\n<\/ul>\n<p><strong><br>Skills<\/strong><\/p>\n<ul>\n<li>\n<p>Programming languages: Python, Java, C++, JavaScript, Go, MATLAB<\/p>\n<\/li>\n<li>\n<p>Libraries: Pandas, NumPy, TensorFlow, Keras, Scikit-Learn<\/p>\n<\/li>\n<li>Development: Spring Boot, Spring MVC, Maven<\/li>\n<li>\n<p>Databases and Cloud-Computing: AWS, Amazon S3, Oracle, BigQuery, PostgreSQL, PL\/SQL<\/p>\n<\/li>\n<li>\n<p>Version control and DevOps: GitHub, GitLab<\/p>\n<\/li>\n<\/ul>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8221;4.25.1&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.25.1&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221;][et_pb_text _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221; hover_enabled=&#8221;0&#8243; global_colors_info=&#8221;{}&#8221; theme_builder_area=&#8221;et_body_layout&#8221; sticky_enabled=&#8221;0&#8243;]<blockquote>\n<p><span style=\"color: #000000\"><a href=\"https:\/\/es.linkedin.com\/in\/mario-collazo-naranjo-b056a0302\" target=\"_blank\" rel=\"noopener\" style=\"color: #000000\"><span style=\"text-decoration: underline\">LinkedIn<\/span><\/a><\/span><\/p>\n<\/blockquote>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section]\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":172,"featured_media":2169,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[80],"tags":[],"class_list":["post-2222","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-projects-2025-2026"],"_links":{"self":[{"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/posts\/2222","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/users\/172"}],"replies":[{"embeddable":true,"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/comments?post=2222"}],"version-history":[{"count":2,"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/posts\/2222\/revisions"}],"predecessor-version":[{"id":2226,"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/posts\/2222\/revisions\/2226"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/media\/2169"}],"wp:attachment":[{"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/media?parent=2222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/categories?post=2222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/catedramasmovil.uc3m.es\/en\/wp-json\/wp\/v2\/tags?post=2222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}