Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

1. Introduction

•

Big Picture of LLMs

◦

Pre-training: endows the model with strong capabilities to generate natural language responses

◦

Instruction-Tuning: model이 human preference와 align시키는 것 → instruction에 반응해 지식을 꺼내오도록 tuning시키는 것

•

Instruction-Response Dataset Construction

◦

Self-Instruct: prompts the model to generate its own instruction-following data

◦

employs LLMs to SFT sLLMs

•

Recently, LIMA

◦

Hypothesize,

▪

LLM이 지식을 대부분 pre-training 때 배우고, instruction-following data는 instruction 따르게 하는 역할만 한다.

◦

고퀄리티(human-annotated) 소수 데이터만으로 Instruction-Tuning이 가능하다.

→ 이는 많은 비용을 들여서 최소 하한선 이상의 instruction-following data를 선별해 내야할 필요가 있다는 것을 의미한다.

→ 그렇다면, high-quality instruction-following data를 효율적으로 선별하는 방법이 무엇일까?에 대한 답변을 해당 논문에서 제시함

2. Methodology

2.1 What is Instruction Quality?

•

LIMA에서와 같이 이 논문에서는 instruction-following data의 역할은 instruction을 따르게 하는 역할을 얼마나 잘 수행해주게 하느냐로 보고 있다.

•

다시말해,  high-quality instruction-following data로 tuning된 LLM은 어떤 insturction에도 그 의도에 맞게 반응해야 한다.

저자들은 다음과 같은 가설을 하나 제시한다.

→ D_{eval}은 high-quality이며, unbiased evaluation set이다.

→ 어떤 모델 M을 (training setting S로) 어떤 instruction-following data D로 FT해 M~을 만들고 D_{eval}에 평가해 ‘inference loss’를 평가해 이를 quality로 본다.

→ instruction-following data D이 여러개 있다고 했을때, 이에 대한 각각의 loss를 구한 후 비율 계산을 한게 (1)이다.

2.2 Quality Evaluation

•

2.1에서 Inference loss를 instruction quality를 정의하는데 활용하였으니, 이제 그 좋은 quality를 근사할 method를 다음과 같이 제안함.

•

저자들은  아래의 Natural Language Indicators들을 활용해 Inference Loss를 근사하고자 함.

→ instruction-following data D를 indicator set I = {I_{i}(N}에 전부 통과시켜서 score를 계산한다. (indicator 별로)

→ 2.1의 L( ̃ M , Deval) 가 어떤 함수 F에 의해서 F (I(D))처럼 근사될 수 있다고 주장한다.

→ F는 multiple linear regression을 쓴다.

→ instruction-Following Loss랑 선형적인 관련이 높은 instruction-Following data의 feature를 찾아내는 과정

2.3 Empirical Study Design

select several candidate datasets. 

fuse and sample from → to form datasets of ‘different quality levels’. 

For each dataset, finetune a language model on it and evaluate the model on a shared evaluation set. 

For each dataset, calculate a bag of indicator values on the dataset. 

Perform a linear regression analysis based on our curated experiment results to estimate the linear rule parameters.

#### Multivariate Evaluation

•

위의 setting

#### Univariate Evaluation

•

study the individual correlation between each indicator and instruction data quality (loss)

•

instruction-following data D를 하나의 Indicator score 기준 내림차순으로 K구간으로 나눈 후, inference loss 계산

•

correlation score: Indicator score  Inference Loss

3. Empirical Settings

Datasets

#### Training

(ALPACA → 2.0K)

→ a random number r_{i} for each dataset and randomly selecting 2000 ∗ r_{i}/ sigma r_{i} samples from each dataset for combination.

→ # of instruction-following data D: 2000

→ K=8 for Univariate Evaluation

#### D_{eval}

•

252 instructions from Wang et al. (2022a) and 80 from Zheng et al. (2023)

•

we employed gpt-3.5-turbo from OPENAI to generate five unique outputs for each instruction.

Finetuning Settings

•

LLAMA-7B. 

•

8bit QLORA

•

finetuning for 3 epochs, with per step batch size set to 8

4. Empirical Results

Multivariate Analysis

•

총 78개의 instruction-following data D를 생성.

•

Stepwise regression로 분석 진행

→ 유의미하게 나온 회기계수는 Reward, Length, KNN, 상수항

•

Reward: Instruction에 대한 Response의 Reward 점수가 전반적으로 클수록

•

KNN-6: sentence BERT space에서 6번째 centroid까지의 거리 (클수록 밀도가 퍼져있음)

→ Diversity가 큰 Instruction-Following Pair일수록, high-quality instruction-following data이다.

Univariate Analysis

→ Series cyan represents data collected from multivariate analysis (randomly sampled)

→ Series yellow represents data collected from univariate analysis (hierarchically sampled)

•

PPL, MTLD(제한된 개수의 데이터셋에 diversity가 높아지면 예측하기 더 어려워져서?), Nat, and Und exhibit positive correlations with the anticipated evaluation loss

•

Rew and Coh showcase negative correlations with the evaluation loss.

Quality-Guided Instruction Selection

•

Equation (4)를 기반으로 unseen dataset인 ‘databricks-dolly-15k’에서 subsampling을 해서 유효성 검증

→ E: subsampling한 dataset으로 LLAMA training 후 D_{eval}에 평가

→ Rule: Equation (4)의 값

→ Rule값이 작아질 수록 unseen data에서도 D_{eval}의 inference loss 값이 떨어짐 (random subsampling보다 유효성이 있음)

•

Rule-selected vs Random Selected Dataset으로 Tuning한 LLAMA에 대해서 gpt3.5-turbo랑 gp4로 D_{eval}의 생성 결과에 대한 정성 평가 진행

→ Evaluation dataset response만드는데 gpt3.5-turbo 사용해서 gpt3.5-turbo에서 좋은 결과 나오지 않았을까?

→ Foundation에서는 큰 차이가 없기 때문에 gpt4에서는 결과차이가 없다고 주장하는데..

(이 부분에 대한 논문의 주장은 좀 아쉽..)

5. Conclusion

Limitation

•

include limited amount of simple indicators

•

only study the relationship between indicator values and inference loss value on fixed base model, LLAMA-7B.

•

inference loss가 정말 최선의 quality metric인가?