Unstructured Big Data Threat Intelligence Parallel Mining Algorithm
To efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Tsinghua University Press
2024-06-01
|
Series: | Big Data Mining and Analytics |
Subjects: | |
Online Access: | https://www.sciopen.com/article/10.26599/BDMA.2023.9020032 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832544925849223168 |
---|---|
author | Zhihua Li Xinye Yu Tao Wei Junhao Qian |
author_facet | Zhihua Li Xinye Yu Tao Wei Junhao Qian |
author_sort | Zhihua Li |
collection | DOAJ |
description | To efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into a standardized text format. Subsequently, five tactics category labels are annotated, creating a multi-label dataset for tactics classification. Addressing the limitations of low execution efficiency and scalability in the sequential deep forest algorithm, our PDFMLC algorithm employs broadcast variables and the Lempel-Ziv-Welch (LZW) algorithm, significantly enhancing its acceleration ratio. Furthermore, our proposed PDFMLC algorithm incorporates label mutual information from the established dataset as input features. This captures latent label associations, significantly improving classification accuracy. Finally, we present the PDFMLC-based Threat Intelligence Mining (PDFMLC-TIM) method. Experimental results demonstrate that the PDFMLC algorithm exhibits exceptional node scalability and execution efficiency. Simultaneously, the PDFMLC-TIM method proficiently conducts text classification on cybersecurity analysis reports, extracting tactics entities to construct comprehensive threat intelligence. As a result, successfully formatted STIX2.1 threat intelligence is established. |
format | Article |
id | doaj-art-ebd1e2149746426cad7fb6a1ec547934 |
institution | Kabale University |
issn | 2096-0654 |
language | English |
publishDate | 2024-06-01 |
publisher | Tsinghua University Press |
record_format | Article |
series | Big Data Mining and Analytics |
spelling | doaj-art-ebd1e2149746426cad7fb6a1ec5479342025-02-03T09:08:16ZengTsinghua University PressBig Data Mining and Analytics2096-06542024-06-017253154610.26599/BDMA.2023.9020032Unstructured Big Data Threat Intelligence Parallel Mining AlgorithmZhihua Li0Xinye Yu1Tao Wei2Junhao Qian3School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, ChinaSchool of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, ChinaSchool of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, ChinaSchool of IoT Engineering, Jiangnan University, Wuxi 214122, ChinaTo efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into a standardized text format. Subsequently, five tactics category labels are annotated, creating a multi-label dataset for tactics classification. Addressing the limitations of low execution efficiency and scalability in the sequential deep forest algorithm, our PDFMLC algorithm employs broadcast variables and the Lempel-Ziv-Welch (LZW) algorithm, significantly enhancing its acceleration ratio. Furthermore, our proposed PDFMLC algorithm incorporates label mutual information from the established dataset as input features. This captures latent label associations, significantly improving classification accuracy. Finally, we present the PDFMLC-based Threat Intelligence Mining (PDFMLC-TIM) method. Experimental results demonstrate that the PDFMLC algorithm exhibits exceptional node scalability and execution efficiency. Simultaneously, the PDFMLC-TIM method proficiently conducts text classification on cybersecurity analysis reports, extracting tactics entities to construct comprehensive threat intelligence. As a result, successfully formatted STIX2.1 threat intelligence is established.https://www.sciopen.com/article/10.26599/BDMA.2023.9020032unstructured big data miningparallel deep forestmulti-label classification algorithmthreat intelligence |
spellingShingle | Zhihua Li Xinye Yu Tao Wei Junhao Qian Unstructured Big Data Threat Intelligence Parallel Mining Algorithm Big Data Mining and Analytics unstructured big data mining parallel deep forest multi-label classification algorithm threat intelligence |
title | Unstructured Big Data Threat Intelligence Parallel Mining Algorithm |
title_full | Unstructured Big Data Threat Intelligence Parallel Mining Algorithm |
title_fullStr | Unstructured Big Data Threat Intelligence Parallel Mining Algorithm |
title_full_unstemmed | Unstructured Big Data Threat Intelligence Parallel Mining Algorithm |
title_short | Unstructured Big Data Threat Intelligence Parallel Mining Algorithm |
title_sort | unstructured big data threat intelligence parallel mining algorithm |
topic | unstructured big data mining parallel deep forest multi-label classification algorithm threat intelligence |
url | https://www.sciopen.com/article/10.26599/BDMA.2023.9020032 |
work_keys_str_mv | AT zhihuali unstructuredbigdatathreatintelligenceparallelminingalgorithm AT xinyeyu unstructuredbigdatathreatintelligenceparallelminingalgorithm AT taowei unstructuredbigdatathreatintelligenceparallelminingalgorithm AT junhaoqian unstructuredbigdatathreatintelligenceparallelminingalgorithm |