Objective


This work presents a hybrid system for binary file classification, primarily focused on malware detection. The process begins with a binary file, such as a PE executable or an APK application, which is analyzed using IDA Pro to obtain its assembly code representation. From this assembly code, a token structure is generated that represents instructions, operands, calls, jumps, and other relevant elements of the program.

Once the token sequence is obtained, it is fed into a convolutional neural network. The CNN applies filters over the sequence to detect local code patterns, such as instruction combinations, frequent opcodes, repetitive structures, or possible signs of obfuscation. Based on these patterns, the network generates a dense embedding that summarizes the most relevant information of the binary from a sequential perspective into a compact vector.

In addition to this representation, the binary is also modeled as a graph, for example through a control flow graph. In this graph, nodes may represent basic blocks or functions, while edges indicate execution relationships, jumps, or calls between different parts of the program. The embedding generated by the CNN is integrated with this structural representation to enrich the available information before classification.

Finally, the enriched graph is processed using a graph neural network. The GNN applies message passing between nodes, combining the information of each element with that of its neighbors and capturing structural dependencies within the program. After several layers, a global representation of the binary is obtained and fed into a final classifier to determine whether the file is malicious or benign. The main contribution of this work is the combination of the local patterns learned by the CNN with the global structure captured by the GNN.

BACHELOR’S THESIS BY

ÁLVARO GARCÍA PIQUERAS

Academic experience

  • Computer Science and Engineering Degree, Universidad Carlos III de Madrid (september 2022 – september 2026)

     

    Work experience

    • AI researcher – Universidad Carlos III de Madrid along with Grupo MasOrange (october 2025 — june 2026)


    Skills

    • Programming languages: Python, Java, C++, JavaScript, Go, MATLAB

    • Libraries: Pandas, NumPy, TensorFlow, Keras, Scikit-Learn

    • Databases and Cloud-Computing: AWS, Amazon S3, Oracle, BigQuery, PostgreSQL, PL/SQL

    • Version control and DevOps: GitHub, GitLab

    • AI, Machine Learning, Neural-Networks, Embedding, GNN