MSSA: multi-stage semantic-aware neural network for binary code similarity detection

Binary code similarity detection (BCSD) aims to identify whether a pair of binary code snippets is similar, which is widely used for tasks such as malware analysis, patch analysis, and clone detection. Current state-of-the-art approaches are based on Transformer, which require substantial computatio...

Full description

Saved in:
Bibliographic Details
Main Authors: Bangrui Wan, Jianjun Zhou, Ying Wang, Feng Chen, Ying Qian
Format: Article
Language:English
Published: PeerJ Inc. 2025-01-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-2504.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832594288981049344
author Bangrui Wan
Jianjun Zhou
Ying Wang
Feng Chen
Ying Qian
author_facet Bangrui Wan
Jianjun Zhou
Ying Wang
Feng Chen
Ying Qian
author_sort Bangrui Wan
collection DOAJ
description Binary code similarity detection (BCSD) aims to identify whether a pair of binary code snippets is similar, which is widely used for tasks such as malware analysis, patch analysis, and clone detection. Current state-of-the-art approaches are based on Transformer, which require substantial computation resources. Learning-based approaches remains room for optimization in learning the deeper semantics of binary code. In this paper, we propose MSSA, a multi-stage semantic-aware neural network for BCSD at the function level. It effectively integrates the semantic and structural information of assembly instructions within and between basic blocks, and across the entire function through four semantic-aware neural networks, achieving deep understanding of binary code semantics. MSSA is a lightweight model with only 0.38M parameters in its backbone network, suitable for deployment in CPU environments. Experimental results show that MSSA outperforms Gemini, Asm2Vec, SAFE, and jTrans in classification performance and ranks second only to the Transformer-based jTrans in retrieval performance.
format Article
id doaj-art-715cba494c79410eb2654a6a638a2ab3
institution Kabale University
issn 2376-5992
language English
publishDate 2025-01-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj-art-715cba494c79410eb2654a6a638a2ab32025-01-19T15:05:10ZengPeerJ Inc.PeerJ Computer Science2376-59922025-01-0111e250410.7717/peerj-cs.2504MSSA: multi-stage semantic-aware neural network for binary code similarity detectionBangrui Wan0Jianjun Zhou1Ying Wang2Feng Chen3Ying Qian4School of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing, ChinaSchool of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing, ChinaSchool of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing, ChinaSchool of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing, ChinaSchool of Software Engineering, Chongqing University of Posts and Telecommunications, Chongqing, ChinaBinary code similarity detection (BCSD) aims to identify whether a pair of binary code snippets is similar, which is widely used for tasks such as malware analysis, patch analysis, and clone detection. Current state-of-the-art approaches are based on Transformer, which require substantial computation resources. Learning-based approaches remains room for optimization in learning the deeper semantics of binary code. In this paper, we propose MSSA, a multi-stage semantic-aware neural network for BCSD at the function level. It effectively integrates the semantic and structural information of assembly instructions within and between basic blocks, and across the entire function through four semantic-aware neural networks, achieving deep understanding of binary code semantics. MSSA is a lightweight model with only 0.38M parameters in its backbone network, suitable for deployment in CPU environments. Experimental results show that MSSA outperforms Gemini, Asm2Vec, SAFE, and jTrans in classification performance and ranks second only to the Transformer-based jTrans in retrieval performance.https://peerj.com/articles/cs-2504.pdfBinary analysisSimilarity detectionNeural network
spellingShingle Bangrui Wan
Jianjun Zhou
Ying Wang
Feng Chen
Ying Qian
MSSA: multi-stage semantic-aware neural network for binary code similarity detection
PeerJ Computer Science
Binary analysis
Similarity detection
Neural network
title MSSA: multi-stage semantic-aware neural network for binary code similarity detection
title_full MSSA: multi-stage semantic-aware neural network for binary code similarity detection
title_fullStr MSSA: multi-stage semantic-aware neural network for binary code similarity detection
title_full_unstemmed MSSA: multi-stage semantic-aware neural network for binary code similarity detection
title_short MSSA: multi-stage semantic-aware neural network for binary code similarity detection
title_sort mssa multi stage semantic aware neural network for binary code similarity detection
topic Binary analysis
Similarity detection
Neural network
url https://peerj.com/articles/cs-2504.pdf
work_keys_str_mv AT bangruiwan mssamultistagesemanticawareneuralnetworkforbinarycodesimilaritydetection
AT jianjunzhou mssamultistagesemanticawareneuralnetworkforbinarycodesimilaritydetection
AT yingwang mssamultistagesemanticawareneuralnetworkforbinarycodesimilaritydetection
AT fengchen mssamultistagesemanticawareneuralnetworkforbinarycodesimilaritydetection
AT yingqian mssamultistagesemanticawareneuralnetworkforbinarycodesimilaritydetection