Enhancing Cloud Job Failure Prediction With a Novel Multilayer Voting-Based Framework
In modern cloud data centers, accurately predicting job failures before they occur is essential for ensuring system reliability, availability, and efficiency. To address this challenge, researchers have progressively developed machine learning and deep learning techniques that examine cloud logs to...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11104242/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In modern cloud data centers, accurately predicting job failures before they occur is essential for ensuring system reliability, availability, and efficiency. To address this challenge, researchers have progressively developed machine learning and deep learning techniques that examine cloud logs to identify patterns linked to such failures. To surpass the prediction accuracy of earlier models, this paper presents the Multilayer Multi-Prediction Framework (MMPF), a novel approach specifically designed to enhance failure prediction accuracy. The framework is divided into two distinct layers. The first layer aims to detect job failures using a fine-tuned voting mechanism, while the second layer identifies the specific failure type. In the first layer, several classifiers are integrated to build a robust predictive model, which outperforms conventional methods that rely on a single classifier. By aggregating the predictions from multiple base classifiers and making the final decision based on a weighted average of the predicted probabilities, this approach delivers significantly higher accuracy in failure prediction. Decision Trees, K-Nearest Neighbors, Extreme Gradient Boosting, Adaptive Boosting, and Artificial Neural Networks are the classifiers implemented in this layer. Once the failed jobs are detected, the second layer takes over to determine the failure type employing the Random Forest algorithm. Based on experiments conducted with the Google Cluster 2019 trace dataset, the framework achieved impressive accuracy: 99.83% in detecting failed jobs and 99.97% in identifying the nature of the failure. |
|---|---|
| ISSN: | 2169-3536 |