It is all in the [MASK]: Simple instruction-tuning enables BERT-like masked language models as generative classifiers

While encoder-only models such as BERT and ModernBERT are ubiquitous in real-world NLP applications, their conventional reliance on task-specific classification heads can limit their applicability compared to decoder-based large language models (LLMs). In this work, we introduce ModernBERT-Large-Ins...

Full description

Saved in:

Bibliographic Details
Main Authors:	Benjamin Clavié, Nathan Cooper, Benjamin Warner
Format:	Article
Language:	English
Published:	Elsevier 2025-06-01
Series:	Natural Language Processing Journal
Subjects:	Zero-shot classification Multiple-choice question answering Encoder models BERT ModernBERT Masked language modeling
Online Access:	http://www.sciencedirect.com/science/article/pii/S2949719125000263
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	While encoder-only models such as BERT and ModernBERT are ubiquitous in real-world NLP applications, their conventional reliance on task-specific classification heads can limit their applicability compared to decoder-based large language models (LLMs). In this work, we introduce ModernBERT-Large-Instruct, a 0.4B-parameter encoder model that leverages its masked language modeling (MLM) head for generative classification. We design a simple approach, extracting all single-token answers from the FLAN dataset collection, and re-purposing standard MLM pre-training to only mask this single token answer. Our approach employs an intentionally simple training loop and inference mechanism that requires no heavy pre-processing, heavily engineered prompting, or architectural modifications. ModernBERT-Large-Instruct exhibits strong zero-shot performance on both classification and knowledge-based tasks, outperforming similarly sized LLMs on MMLU and achieving 93% of Llama3-1B’s MMLU performance with 60% less parameters. We also demonstrate that, when fine-tuned, the generative approach using the MLM head matches or even surpasses traditional classification-head methods across diverse NLU tasks. This capability emerges specifically in models trained on contemporary, diverse data mixes, with models trained on lower volume, less-diverse data yielding considerably weaker performance. Although preliminary, these results demonstrate the potential of using the original generative masked language modeling head over traditional task-specific heads for downstream tasks. Our work suggests that further exploration into this area is warranted, highlighting many avenues for future improvements.
ISSN:	2949-7191

It is all in the [MASK]: Simple instruction-tuning enables BERT-like masked language models as generative classifiers

Similar Items