ExpStar IconExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments

1South China University of Technology China
2The Hong Kong Polytechnic University, China
ACM MM 2025
Teaser


Overview. We construct ExpInstruct, the first dataset tailored for experiment commentary generation, featuring over 7K step-level commentaries across 21 scientific subjects from 3 core disciplines. Each sample includes procedural descriptions along with potential scientific principles and safety guidelines. We present ExpStar, an automatic experiment commentary generation MLLM that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge.

Abstract

Experiment commentary is crucial in describing the experimental procedures, delving into underlying scientific principles, and incorporating content-related safety guidelines. In practice, human teachers rely heavily on subject-specific expertise and invest significant time preparing such commentary. To address this challenge, we introduce the task of automatic commentary generation across multi-discipline scientific experiments. While recent progress in large multimodal models (LMMs) has demonstrated promising capabilities in video understanding and reasoning, their ability to generate fine-grained and insightful experiment commentary remains largely underexplored. In this paper, we make the following contributions: (i) Dataset Construction: We construct ExpInstruct, the first dataset tailored for experiment commentary generation, featuring over 7K step-level commentaries across 21 scientific subjects from 3 core disciplines . Each sample includes procedural descriptions along with potential scientific principles and safety guidelines. (ii) Novel Model: We propose ExpStar, an automatic experiment commentary generation model that leverages a retrieval-augmented mechanism to adaptively access, evaluate, and utilize external knowledge. (iii) Promising Result: Extensive experiments show that our ExpStar substantially outperforms 14 leading LMMs, which highlights the superiority of our dataset and model.

ExpInstruct Dataset

Dataset statistics 1

Dataset statistics 2

Method

Overall Structure

Overview of Our Proposed ExpStar. It is built on the Qwen2.5-VL-7B architecture. Special control tokens are employed to guide the model’s behavior: <RET> and <NOT RET> determine whether retrieval is necessary, while <REL> and <NOT REL> evaluate the relevance of retrieved passages. The right three LMMs share parameters.

Experiments

Results1

Qualitative Results

Results3

Qualitative Results of Qwen-2.5-VL-7B, GPT-4o and our ExpStar. The red text denotes inappropriate generation

Results4

More Qualitative Results We provide more qualitative visualizations of commentary generation across multi-discipline scientific experiments.