TL;DR: MIG is an innovative automatic data selection method for instruction tuning. We introduce an information-based dataset measurement that comprehensively quantifies data quality and diversity in semantic space and propose an efficient greedy algorithm that iteratively select data points maximizing information gain.
Data quality and diversity are pivotal in constructing effective instruction-tuning datasets. With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to capture the intent of complex instructions in semantic space accurately. To bridge this gap, we propose a unified dataset information measurement method. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. Based on such measurement, we further introduce an efficient sampling method that selects data samples iteratively to Maximize the Information Gain (MIG) in semantic space. Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench. This finding shows the potential for unified dataset measurement in guiding instruction data selection.
Illustration of (a) Data Selection Pipeline and (b) MIG Sampler. Given the raw data pool, our pipeline first applies a tagger and scorer to annotate data. Next, MIG constructs the label graph based on the label set and iteratively selects the data point that maximizes the information gain within the graph. The selected data are used for supervised fine-tuning (SFT) of LLMs.
MIG outperforms random selection and previous state-of-the-art method on knowledge-based(six benchmarks) and human-preference(three benchmarks) evaluations.
Comparison with data selection methods on the Tulu3 pool based on different base models. HE denotes HumanEval, AE denotes AlpacaEvalv2, MT denotes MTBench, and Wild denotes WildBench. Avgobj
MIG consistently delivers superior performance at each data budget, demonstrating its robust scalability. Remarkably, MIG achieves comparable performance to the full Tulu3 pool with only 20K samples.
article{chen2025mig,
title={MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space},
author={Chen, Yicheng and Li, Yining and Hu, Kai and Ma, Zerun and Ye, Haochen and Chen, Kai}
journal={arXiv preprint arXiv:2504.13835},
year={2025},
}