Cross-Community Question Relevance Prediction for Stack Overflow and GitHub

As the open-source community has evolved, Stack Overflow (SO) has gained extensive usage. The question-and-answer community’s mechanism for recommending related questions helps users discover more content relevant to their current problems, expediting issue resolution. However, the rec...

Full description

Saved in:
Bibliographic Details
Main Authors: Song Yu, Bugao Jiang, Danni Zhang, Zhifang Liao
Format: Article
Language:English
Published: Graz University of Technology 2025-01-01
Series:Journal of Universal Computer Science
Subjects:
Online Access:https://lib.jucs.org/article/119772/download/pdf/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As the open-source community has evolved, Stack Overflow (SO) has gained extensive usage. The question-and-answer community’s mechanism for recommending related questions helps users discover more content relevant to their current problems, expediting issue resolution. However, the recommendation of relevant questions in a single community context limits the amount of available content and the diversity of content, and the recommendation results rely heavily on the existing knowledge of the community. Stack Overflow still harbors a substantial number of unresolved questions. To address this situation, this paper proposes a cross-community question relevance prediction model, CCQRP, to predict the relevance of Stack Overflow ques-tions and GitHub(GH) issues, and recommend relevant GitHub issues. CCQRP aims to assist developers in effectively resolving problems and enhancing development efficiency. We design an embedding layer incorporating BERTOverflow and Bi-LSTM and devise a weighted attention matrix based on named entity types of tokens. This matrix assigns different weights to tokens of varying named entity types during the prediction process, capturing critical information to predict the relevance of SO questions and GH issues. Due to the lack of existing datasets, we construct a dataset named Question-Issue dataset (QI), consisting of Stack Overflow questions, GitHub issues, and the corresponding question-issue relevance, containing 240,000 related SO question-GH issue pairs and 470,000 unrelated pairs. We evaluate the effectiveness of CCQRP on QI. Compared to the latest models (MQDD, CodeBERT, ASIM), CCQRP demonstrates an improvement in F1-score ranging from 0.60% to 10.86% and exhibits robust generalization capabilities.
ISSN:0948-6968