LLM4TDG: test-driven generation of large language models based on enhanced constraint reasoning
Abstract With the evolution of modern software development paradigms, component reuse, and low-code approaches have emerged as mainstream in software development. However, developers often lack an in-depth understanding of reused code. The inability of components to operate autonomously leads to ins...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-05-01
|
| Series: | Cybersecurity |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s42400-024-00335-4 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract With the evolution of modern software development paradigms, component reuse, and low-code approaches have emerged as mainstream in software development. However, developers often lack an in-depth understanding of reused code. The inability of components to operate autonomously leads to insufficient testing of software functionalities and security, further exacerbating the contradiction between the increasing complexity of software architectures and the demand for accurate and efficient software automation testing. This, in turn, increases the frequency of software supply chain security incidents. This paper proposes a test-driven generation framework, LLM4TDG, based on large language models (LLMs). By formally defining the constraint dependency graph and converting it into context constraints, LLMs’ ability to understand natural language descriptions such as test requirements and documents is enhanced. Constraint reasoning and backtracking mechanisms are then used to generate test drivers that satisfy the defined constraints automatically. Using the EvalPlus dataset, we evaluate the comprehensive capabilities of LLM4TDG in test case generation using four general-domain LLMs and five code-generation-domain LLMs. The experimental results indicate that our approach significantly enhances LLMs’ ability to comprehend constraints in testing objectives, achieving a 47.62% increase in constraint understanding across 147 testing tasks. Employing LLM4TDG significantly improves the average pass@k metric of all LLMs by 10.41%. The pass@k metric for CodeQwen-chat has improved by up to 18.66%. The metric surpasses the state-of-the-art GPT-4, with a performance of 92.16% on HUMANEVAL and 87.14% on HUMANEVAL+, which enhances the error correction and functional correctness in test-driven code generation. Meanwhile, Our experiments were conducted on a dataset of Python third-party libraries containing malicious behavior in the context of security testing tasks, validating the effectiveness of our method in real-world applications and its generalization capabilities. |
|---|---|
| ISSN: | 2523-3246 |