Objective
The purpose of this Final Degree Project is the detection and classification of anomalies in the data related to a customer’s network quality. An anomaly is defined as the presence of a complaint by the customer and a complaint is produced due to erratic behaviour in the network, establishing a bidirectional cause-effect relationship between both concepts. The final objective is to obtain two models, the first one to detect complaints produced by anomalies and the second one, once detected, to classify them according to their different typologies.
Initially, an ETL process is carried out on all the corresponding databases, where the imbalance in the data is evident and logical due to the type of problem to be solved, to move on to a final selection of the numerical and categorical attributes, which allows the development of a base model, thus establishing a starting point (1).
In order to measure and compare the quality of the detection models, the confusion matrix is used, which is generally used in a standardised way, the F1 value and the area under the ROC curve, which is also projected graphically (2).
We continue with the classification model since we start from a smaller data set, only the anomalous data that are complaints, where again another imbalance is observed between each of the classes. The same attributes that have been used in the base detection are used and the tree obtained is modelled by modifying its hyperparameters, together with a subsequent pruning once it has been trained, improving the results and obtaining, thanks to this simplification its explainability, being able to observe its complete structure and the most important attributes in the classification (3).
Two models are developed based on the same architecture, but with completely different functions, interpretations and results. In both we will try to match the hit rate for both classes by looking for the cut-off point between
the true positive (TPR) and negative (TNR) rates detected. The first one is the Autoencoder, where we find in its latent space the two-dimensional coding that generates the model from the input data, obtaining better results than the initial base model. The second one is the Variational Autoencoder, whose latent space means the probability distribution of the data belonging to each class, obtaining a new improvement in the results. Finally, with respect to this last (and best) model, it is obtained that if the threshold value that differentiates one class from the other is modified, the specialisation of the same in one class can be observed, which again leads to improved results in each of the predictions.
FINAL DEGREE PROJECT OF:
JAVIER CRUZ DEL VALLE
Degree
Master’s Degree in Computer Science and Technology
Carlos III University of Madrid (September 2022 – present)
Degree in Computer Engineering
Carlos III University of Madrid
(September 2017 – July 2022)
Work Experience
Technical specialist in Artificial Intelligence, part-time. Universidad Carlos III de Madrid in collaboration with Xfera Móviles for the MásMóvil group chair (September 2021 – May 2022).
Software integration IT architect, internship contract. NTT DATA (November 2020 – July 2021)
Technical skills
Programming languages: Python, Java, R and C++
Domain specific languages: SQL and PDDL
Development libraries: Pandas, NumPy, TensorFlow, Keras, Ploty and Scikit-Learn
Other tools: Git, Google tools (Colaboratory and BigQuery)