TL;DR: Auto Cherry Picker is an innovative framework designed to synthesize training samples for both perception and multi-modal reasoning tasks from a simple object list in natural language. It employs a nowly designed metric, CLIS, to ensure the quality of the synthetic data.
Diffusion-based models have shown great potential in generating high-quality images with various layouts, which can benefit downstream perception tasks. However, a fully automatic layout generation driven only by language and a suitable metric for measuring multiple generated instances has not been well explored. In this work, we present Auto Cherry-Picker (ACP), a novel framework that generates high-quality multi-modal training examples to augment perception and multi-modal training. Starting with a simple list of natural language concepts, we first prompt Large Language Models (LLMs) to generate a detailed description and design reasonable layouts, then use an off-the-shelf text-to-image model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric to ensure quality. By customizing the initial concept list, our synthetic high-quality examples boost performance in various scenarios, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that Auto Cherry-Picker can significantly improve the performance of existing models. In addition, we have thoroughly investigated the correlation between our metrics and the performance gains in downstream tasks. Code will be available.
Illustration of Auto Cherry-Picker (ACP) pipeline. Our ACP pipeline contains a raw data generator (a) and a data filter (b) using CLIS. Conditioned on an input object list, Scene Graph Generator generates detailed attributes, relations, caption, and corresponding layouts. Subsequently, Image Generator produces a set of images based on the scene graph. These raw layouts and images are then refined through filters using CLIS-L and CLIS-I, respectively, to produce high-quality training data.
Each pair of synthetic samples is generated on the same input object list, with our CLIS metric favoring the right sample in each pair.
Visualization of training samples for different downstream tasks. Given the same input or scene graph on the left, the CLIS increases from left to right, with final annotations on the right.
article{chen2024autocherrypicker,
title={Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language},
author={Chen, Yicheng and Li, Xiangtai and Li, Yining and Zeng, Yanhong and Wu, Jianzong and Zhao, Xiangyu and Chen, Kai}
journal={arXiv preprint arXiv:2406.20085},
year={2024},
}