New Heuristics Method for Malicious URLs Detection Using Machine Learning

Malicious URLs are a very prominent, dangerous form of cyber threats in view of the fact that they can enable many evils like phishing attacks, malware distribution, and several other kinds of cyber fraud. The techniques of detection conventionally applied are based on blacklisting and heuristic an...

Full description

Saved in:
Bibliographic Details
Main Author: Maher Kassem Hasan
Format: Article
Language:English
Published: College of Computer and Information Technology – University of Wasit, Iraq 2024-09-01
Series:Wasit Journal of Computer and Mathematics Science
Subjects:
Online Access:http://wjcm.uowasit.edu.iq/index.php/wjcm/article/view/267
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Malicious URLs are a very prominent, dangerous form of cyber threats in view of the fact that they can enable many evils like phishing attacks, malware distribution, and several other kinds of cyber fraud. The techniques of detection conventionally applied are based on blacklisting and heuristic analyses, which are gradually becoming inefficient against sophisticated, rapidly evolving threats. In this paper, the authors present various machine learning techniques applied in malicious URL detection. In the present paper, we will look at three machine learning models: Logistic Regression, Random Forest, and Support Vector Machines. We used a methodology that involved collecting data and feature extraction, training a model, then evaluating its performance with different metrics such as accuracy, precision, recall, and F1-score. We implemented and optimized three models—Logistic Regression, Random Forest, and Support Vector Machines (SVM)—based on the literature available that indicates the effectiveness of these models. Logistic Regression shows promising results to detect the malicious URLs, according to Vanitha and Vinodhini. Random Forest models are found to be very robust and accurate according to Cui et al. and Vanhoenshoven et al., SVM models are evidenced to have very high accuracy according to Manjeri et al., Further works on deep learning models emphasized their potentials. In our study, the optimized Random Forest model in our case showed the best performance, and its training accuracy was 99%, while validation accuracy was 90.5%, also logistic Regression and SVM achieved training accuracy was 89.31%, while validation accuracy was 90.5%. All the optimization processes, model performances, and integration into the real-time cybersecurity infrastructures, along with the strengths and limitations, are discussed in this paper. The paper will, therefore, discuss the benefits and challenges for each model in this aspect—emphasizing continuous updating of the models and integrating them into real-time cybersecurity infrastructures.  
ISSN:2788-5879
2788-5887