Logo

LogoM-STAR

Diving into Self-Evolving Training for Multimodal Reasoning

1 The Hong Kong University of Science and Technology, 2Shanghai Jiao Tong University,
3Helixon Research, 4The Chinese University of Hong Kong
* Equal contributions

Introduction

LogoM-STAR (short for Multimodal Self-evolving TrAining for Reasoning) is a project aimed at facilitating multimodal reasoning via self-evolving training.

In LogoM-STAR, we aim to answer the following questions:

  • Can we enhance Multimodal Reasoning through Self-Evolving Training?
  • How can we comprehensively understand each factor of self-evolving training and design an optimal recipe for multimodal reasoning?
  • What insights can we gain from the training Dynamics of Self-Evolution, and how can these insights inform better self-evolving training for multimodal reasoning?
And we release the corresponding code and resources to facilitate future research:
  • M-STAR Framework: A framework for Self-Evolving Training of Large Multimodal Models (LMMs) including Generation, Training, and Rewarding.
  • M-STAR Resources: M-STAR Models, CoT Dataset, and Multimodal Process Reward Model (MPRM) Training Dataset.

Overview

main_figure

Exploration & Exploitation

Self-Evolving Training for Multimodal Reasoning can be considered as a loop about: Exploration and Exploitation. This loop operates as a process where the model generates responses (exploration) to various question-image pairs, and learns to maximize the expected reward (exploitation). Through this self-evolution cycle, the model continually refines its performance. This approach enables the model to improve its reasoning abilities without relying on extensive human-annotated data, making it more efficient and scalable for multimodal reasoning.

Self-Evolving Design Components

To comprehensively understand this loop, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning.

Self-Evolution Dynamics

Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe considering both exploration and exploitation for self-evolving training in multimodal reasoning, encapsulating these design choices into M-STAR framework. M-STAR achieves 59.5% accuracy on MathVista, surpassing the pre-evolved model by 6.9% absolutely without using additional human annotations.

Diving into Self-Evolving Design Components

Warm-Up Phase

Warm-Up Phase Warm-Up Phase

In the realm of multimodal training, Chain-of-Thought (CoT) data is notably scarce, which poses challenges in the development of models capable of generating intermediate reasoning steps. The Warm-Up Phase in our project represents the initial step before self-evolving training, designed to establish a foundational policy model. During this phase, the model is prompted to generate reasoning steps for each input triplet consisting of a question, an image, and an answer. By filtering the responses based on answer accuracy, we conduct warmup training for the policy model, enabling it to begin generating coherent CoT responses. This preparatory phase is crucial in equipping the policy model with the capability to produce intermediate reasoning steps, forming a stepping stone toward more advanced, self-evolving multimodal training.

Training Method

training method

Key findings:

  • Optimizing the model from the last checkpoint is superior to retraining from scratch every time.
  • Inheriting the optimizer states from the previous iteration also leads to performance improvements.
  • Each iteration should maintain an appropriate interval to traverse the queries in training set, neither too large nor too small.
  • All these lead to a more online Self-Evolving Training.

Reward Model

Reward Model Reward Model

Key findings:

  • Including an extra process reward model to re-rank and select the generated responses benefits a lot after filtering out incorrect responses, even if the reward model itself is not a qualified verifier.
  • Our MPRM performs better as a Reranker than as a Verifier (with Answer Filtering).

Prompt Variation

Prompt Variation

Key findings:

  • Adding more unlabeled queries helps only when having perfect reward signals (e.g., the oracle groundtruth answers), and it hurts the performance if the reward model does not generalize well on unseen data.

Dynamics of Self-Evolution

Introduction

We delve even deeper into the current self-evolution strategy to better understand the bottlenecks. Instead of analyzing from a design space perspective as previously, we now fix the design parameters and focus exclusively on the training dynamics during the model's self-evolution. This shift in focus allows us to examine the process from an orthogonal angle, providing further insights into the underlying mechanisms that drive or impede progress in multimodal reasoning capabilities.

Monitoring the Training Dynamics

We analyze several key metrics to understand how model changes during the evolution process.

  • Greedy Accuracy: the model's accuracy with greedy decoding. We track this metric for reference to compare with other metrics.
  • Passk@K: the percentage of samples for which the model produces at least one correct response when sampling K candidates. This metric measures the model's exploration ability.
  • Pass@K – Greedy Accuracy: the difference between Pass@K and Greedy accuracy. Typically, Pass@K is an upper bound of Greedy Accuracy, and the gap roughly reflects the percentage of samples where the model, while failing in greedy decoding, can generate a correct response when sampling more candidates. This gap is crucial for the success of self-evolving trainingβ€”a zero gap indicates that the model fails to explore correct responses for the current failure cases, suggesting that further training is unlikely to yield significant improvement.
  • Reward-Pass@2: the percentage of samples for which there exist correct responses among the top 2 responses ranked by the reward model. This metric directly reflects the exploitation efficacy of the reward model for the current policy. We choose Pass@2 since our training strategy involves selecting the top 2 responses using the reward model
These metrics serve as a deeper exploration & exploitation measure beyond Greedy Accuracy, providing insights into the saturation of exploration and help us assess the model's progression towards optimal performance.

Progress or Regress
Pass@K Pass@K - Greedy Gap RMPass@2

Exploration saturates during the process of self-evolution especially when the temperature is low. How can we enhance exploration to allow the reward model to exploit more effectively?

Adaptive Explorations

Final Recipe Dynamic Strategy
  • Monitor Exploration & Exploitation during training via validation set
  • Reward-Pass@2: A Bridge between Exploration (Generation) & Exploitation (Rerank)βœ…
  • Pass@K: Only considers Exploration ❌
  • Performance can be further improved
  • Alleviate the saturate of self-evolution

BibTeX

@misc{liu2024diving,
      title={Diving into Self-Evolve Training for Multimodal Reasoning}, 
      author={Wei Liu and Junlong Li and Xiwen Zhang and Fan Zhou and Yu Cheng and Junxian He},
      year={2024},
      eprint={...},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}