Unstructured Big Data Threat Intelligence Parallel Mining Algorithm

To efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhihua Li, Xinye Yu, Tao Wei, Junhao Qian
Format: Article
Language:English
Published: Tsinghua University Press 2024-06-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2023.9020032
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832544925849223168
author Zhihua Li
Xinye Yu
Tao Wei
Junhao Qian
author_facet Zhihua Li
Xinye Yu
Tao Wei
Junhao Qian
author_sort Zhihua Li
collection DOAJ
description To efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into a standardized text format. Subsequently, five tactics category labels are annotated, creating a multi-label dataset for tactics classification. Addressing the limitations of low execution efficiency and scalability in the sequential deep forest algorithm, our PDFMLC algorithm employs broadcast variables and the Lempel-Ziv-Welch (LZW) algorithm, significantly enhancing its acceleration ratio. Furthermore, our proposed PDFMLC algorithm incorporates label mutual information from the established dataset as input features. This captures latent label associations, significantly improving classification accuracy. Finally, we present the PDFMLC-based Threat Intelligence Mining (PDFMLC-TIM) method. Experimental results demonstrate that the PDFMLC algorithm exhibits exceptional node scalability and execution efficiency. Simultaneously, the PDFMLC-TIM method proficiently conducts text classification on cybersecurity analysis reports, extracting tactics entities to construct comprehensive threat intelligence. As a result, successfully formatted STIX2.1 threat intelligence is established.
format Article
id doaj-art-ebd1e2149746426cad7fb6a1ec547934
institution Kabale University
issn 2096-0654
language English
publishDate 2024-06-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-ebd1e2149746426cad7fb6a1ec5479342025-02-03T09:08:16ZengTsinghua University PressBig Data Mining and Analytics2096-06542024-06-017253154610.26599/BDMA.2023.9020032Unstructured Big Data Threat Intelligence Parallel Mining AlgorithmZhihua Li0Xinye Yu1Tao Wei2Junhao Qian3School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, ChinaSchool of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, ChinaSchool of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, ChinaSchool of IoT Engineering, Jiangnan University, Wuxi 214122, ChinaTo efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into a standardized text format. Subsequently, five tactics category labels are annotated, creating a multi-label dataset for tactics classification. Addressing the limitations of low execution efficiency and scalability in the sequential deep forest algorithm, our PDFMLC algorithm employs broadcast variables and the Lempel-Ziv-Welch (LZW) algorithm, significantly enhancing its acceleration ratio. Furthermore, our proposed PDFMLC algorithm incorporates label mutual information from the established dataset as input features. This captures latent label associations, significantly improving classification accuracy. Finally, we present the PDFMLC-based Threat Intelligence Mining (PDFMLC-TIM) method. Experimental results demonstrate that the PDFMLC algorithm exhibits exceptional node scalability and execution efficiency. Simultaneously, the PDFMLC-TIM method proficiently conducts text classification on cybersecurity analysis reports, extracting tactics entities to construct comprehensive threat intelligence. As a result, successfully formatted STIX2.1 threat intelligence is established.https://www.sciopen.com/article/10.26599/BDMA.2023.9020032unstructured big data miningparallel deep forestmulti-label classification algorithmthreat intelligence
spellingShingle Zhihua Li
Xinye Yu
Tao Wei
Junhao Qian
Unstructured Big Data Threat Intelligence Parallel Mining Algorithm
Big Data Mining and Analytics
unstructured big data mining
parallel deep forest
multi-label classification algorithm
threat intelligence
title Unstructured Big Data Threat Intelligence Parallel Mining Algorithm
title_full Unstructured Big Data Threat Intelligence Parallel Mining Algorithm
title_fullStr Unstructured Big Data Threat Intelligence Parallel Mining Algorithm
title_full_unstemmed Unstructured Big Data Threat Intelligence Parallel Mining Algorithm
title_short Unstructured Big Data Threat Intelligence Parallel Mining Algorithm
title_sort unstructured big data threat intelligence parallel mining algorithm
topic unstructured big data mining
parallel deep forest
multi-label classification algorithm
threat intelligence
url https://www.sciopen.com/article/10.26599/BDMA.2023.9020032
work_keys_str_mv AT zhihuali unstructuredbigdatathreatintelligenceparallelminingalgorithm
AT xinyeyu unstructuredbigdatathreatintelligenceparallelminingalgorithm
AT taowei unstructuredbigdatathreatintelligenceparallelminingalgorithm
AT junhaoqian unstructuredbigdatathreatintelligenceparallelminingalgorithm