Multimodal Preference Data Synthetic Alignment with Reward Model
Paper
•
2412.17417
•
Published
•
2
PDS-DPO-7B is a vision-language model built upon LLaVA 1.5 7B and trained using the proposed Preference Data Synthetic Direct Preference Optimization (PDS-DPO) framework. This approach leverages synthetic data generated using generative and reward models as proxies for human preferences to improve alignment, reduce hallucinations, and enhance reasoning capabilities.
@article{wijaya2024multimodal,
title={Multimodal Preference Data Synthetic Alignment with Reward Model},
author={Wijaya, Robert and Nguyen, Ngoc-Bao and Cheung, Ngai-Man},
journal={arXiv preprint arXiv:2412.17417},
year={2024}
}