Data Generation for Neural Disinformation Detection
Abstract: Incorporating large language models for various domain-specific NLP tasks has become prevalent due to the easy availability of pre-trained model checkpoints. However, fine-tuning these pre-trained models is necessary to improve performance on domain-specific tasks. Neural fake news detection is one such domain-specific task where the large language model needs to detect machine-generated fake news. Fine-tuning for a neural fake news detection task can be challenging since it requires collecting actual news articles and generating neural fake news counterparts. Therefore, in this paper, we explore the characteristics of the underlying data generation process of fine-tuning large language models for neural fake news detection. We present experiments to develop a deeper understanding of the fundamental properties of data generation. Some interesting findings have the potential to guide future research on neural fake news detection and to determine the quantity and variability of data required for fine-tuning large language models.