M-STAR: Diving into Self-Evolving Training for Multimodal Reasoning

M-STAR (short for Multimodal Self-evolving TrAining for Reasoning) is a project aimed at facilitating multimodal reasoning via self-evolving training.

In M-STAR, we aim to answer the following questions:

Can we enhance Multimodal Reasoning through Self-Evolving Training?
How can we comprehensively understand each factor of self-evolving training and design an optimal recipe for multimodal reasoning?
What insights can we gain from the training Dynamics of Self-Evolution, and how can these insights inform better self-evolving training for multimodal reasoning?

And we release the corresponding code and resources to facilitate future research:

M-STAR Framework: A framework for Self-Evolving Training of Large Multimodal Models (LMMs) including Generation, Training, and Rewarding.
M-STAR Resources: M-STAR Models, CoT Dataset, and Multimodal Process Reward Model (MPRM) Training Dataset.

Self-Evolving Training for Multimodal Reasoning can be considered as a loop about: Exploration and Exploitation. This loop operates as a process where the model generates responses (exploration) to various question-image pairs, and learns to maximize the expected reward (exploitation). Through this self-evolution cycle, the model continually refines its performance. This approach enables the model to improve its reasoning abilities without relying on extensive human-annotated data, making it more efficient and scalable for multimodal reasoning.

To comprehensively understand this loop, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning.

Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe considering both exploration and exploitation for self-evolving training in multimodal reasoning, encapsulating these design choices into M-STAR framework. M-STAR achieves 59.5% accuracy on MathVista, surpassing the pre-evolved model by 6.9% absolutely without using additional human annotations.

In the realm of multimodal training, Chain-of-Thought (CoT) data is notably scarce, which poses challenges in the development of models capable of generating intermediate reasoning steps. The Warm-Up Phase in our project represents the initial step before self-evolving training, designed to establish a foundational policy model. During this phase, the model is prompted to generate reasoning steps for each input triplet consisting of a question, an image, and an answer. By filtering the responses based on answer accuracy, we conduct warmup training for the policy model, enabling it to begin generating coherent CoT responses. This preparatory phase is crucial in equipping the policy model with the capability to produce intermediate reasoning steps, forming a stepping stone toward more advanced, self-evolving multimodal training.

Key findings:

Optimizing the model from the last checkpoint is superior to retraining from scratch every time.
Inheriting the optimizer states from the previous iteration also leads to performance improvements.
Each iteration should maintain an appropriate interval to traverse the queries in training set, neither too large nor too small.
All these lead to a more online Self-Evolving Training.

Key findings:

Including an extra process reward model to re-rank and select the generated responses benefits a lot after filtering out incorrect responses, even if the reward model itself is not a qualified verifier.
Our MPRM performs better as a Reranker than as a Verifier (with Answer Filtering).

Key findings:

Adding more unlabeled queries helps only when having perfect reward signals (e.g., the oracle groundtruth answers), and it hurts the performance if the reward model does not generalize well on unseen data.

We delve even deeper into the current self-evolution strategy to better understand the bottlenecks. Instead of analyzing from a design space perspective as previously, we now fix the design parameters and focus exclusively on the training dynamics during the model's self-evolution. This shift in focus allows us to examine the process from an orthogonal angle, providing further insights into the underlying mechanisms that drive or impede progress in multimodal reasoning capabilities.

We analyze several key metrics to understand how model changes during the evolution process.

Greedy Accuracy: the model's accuracy with greedy decoding. We track this metric for reference to compare with other metrics.
Passk@K: the percentage of samples for which the model produces at least one correct response when sampling K candidates. This metric measures the model's exploration ability.
Pass@K – Greedy Accuracy: the difference between Pass@K and Greedy accuracy. Typically, Pass@K is an upper bound of Greedy Accuracy, and the gap roughly reflects the percentage of samples where the model, while failing in greedy decoding, can generate a correct response when sampling more candidates. This gap is crucial for the success of self-evolving training—a zero gap indicates that the model fails to explore correct responses for the current failure cases, suggesting that further training is unlikely to yield significant improvement.
Reward-Pass@2: the percentage of samples for which there exist correct responses among the top 2 responses ranked by the reward model. This metric directly reflects the exploitation efficacy of the reward model for the current policy. We choose Pass@2 since our training strategy involves selecting the top 2 responses using the reward model

These metrics serve as a deeper exploration & exploitation measure beyond Greedy Accuracy, providing insights into the saturation of exploration and help us assess the model's progression towards optimal performance.

Exploration saturates during the process of self-evolution especially when the temperature is low. How can we enhance exploration to allow the reward model to exploit more effectively?

Monitor Exploration & Exploitation during training via validation set
Reward-Pass@2: A Bridge between Exploration (Generation) & Exploitation (Rerank)✅
Performance can be further improved
Alleviate the saturate of self-evolution
M-STAR achieves the highest results for all of the 3 backbone LMMs and the improvement is generally consistent across all sub-tasks

Models self-evolved with M-STAR consistently outperform both the base models and those trained with warmup across nearly all benchmarks
Smaller models face greater challenges in generalizing beyond their training data
Larger models such as Phi-3.5-vision and MiniCPM-V-2.5 demonstrate significantly improved generalization

BibTeX

@misc{liu2024divingselfevolvingtrainingmultimodal,
      title={Diving into Self-Evolving Training for Multimodal Reasoning}, 
      author={Wei Liu and Junlong Li and Xiwen Zhang and Fan Zhou and Yu Cheng and Junxian He},
      year={2024},
      eprint={2412.17451},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.17451},
}

M-STAR

Diving into Self-Evolving Training for Multimodal Reasoning

Introduction

Overview

Exploration & Exploitation

Self-Evolving Design Components

Self-Evolution Dynamics

Diving into Self-Evolving Design Components

Warm-Up Phase

Training Method

Reward Model

Prompt Variation

Dynamics of Self-Evolution

Introduction

Monitoring the Training Dynamics

Adaptive Explorations

M-STAR on More Diverse Benchmarks

BibTeX