Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations

0. Abstract

•

In-context learning setting에서 Input내에 있는 여러 feature들이 label에 미치는 영향이 동일할 때, 과연 LLM은 특정 feature를 선호하도록 bias가 껴져 있을까?

•

LLM에 Feature bias가 있다면, inductive bias로 prior bias를 어느정도 교정할 수 있을까?

1. Introduction

•

ICL의 한계점은 제한된 개수의 demonstration만으로 문제를 풀기 때문에, model이 input 내의 어떤 feature와 반응해 label과의 관계성을 만들어내는지 파악하기 힘들다는데 있다. (= Because task is underspecified from pure input)

•

위와 같이 demonstration내에서 topic feature와 sentiment feature가 같은 labe(=1)로 주어질 경우, test sample을 풀때 어떤 feature에 기반해서 정답을 결정내릴지가 불명확해진다.

(실제 setting에서는 더 많은 feature들이 비슷하게 작용할 것)

•

따라서 ICL로 특정 task를 잘풀기 위해서는

(a) LLM이 그 task를 위한 inductive bias를 이미 가지고 있거나

(b) Task에 의해서 명시될 수 있는 Inductive bias를 사후적으로 부과해야 한다.

•

본 논문에서는 (a)에 대해서 먼저 분석하며, 이후 (b)를 분석하기 위해 다양한 inductive bias의 효과를 독립적으로 개입시켜 LLM의 성능 변화를 측정한다.

2. Set Up

Preliminary

→ ICL은 기본적으로 Demonstration Label 분포에 따라 Model Prediction이 많이 바뀜 (Majority, Recency Bias)

→ 따라서, spurious feature label 간의 엄밀한 상관관계를 보기 위해서는 특정 label에 쏠리지 않고 (아마 특정 label (e.g. positive) intended feature가 많을 것으로 예상됨), 여러 feature의 성질을 uniform 담을 수 있는 input distribution을 준비하는게 실험상으로 맞긴함

Main Setting

•

기본 실험 환경은 text classification이고, input text내 feature(positive=1/negative=0, movie=1/food=0)는 2가지만 있다고 가정함

•

Input text내 feature의 존재 유무를 판단할 수 있는 function(positive/negative: h1, movie/food: h2)이 존재함

•

Model이 ICL setting에서 어떤 feature에 반응해 label(1 or 0)과 파악하는게 논문의 1차 목표이기 때문에 아래 2가지 데이터셋을 먼저 정의함

◦

underspecified demonstrations: 2가지 feature의 레이블 값이 모두 같은 input text x, (h1(x)=h2(x) = 0 or 1, label are balanced)

⇒ positive & movie | negative & food

◦

disambiguation data: 2가지 feature의 레이블 값이 모두 같은 input text x, (h1(x) ≠ h2(x) = 0 or 1, label are balanced)

⇒ positive & food | negative & movie

•

learning algorithm l(x)은 데이터 D(D안에 여러 x가 존재)를 classifier f(x)로 mapping 시켜주는 역할

⇒ Intuition: underspecified demonstrations으로 prompting하고 disambiguation data로 예측한 결과 값(=P(y| disambiguation datum, underspecified demonstrations) = f(x)을 과 h1(x)랑 h2(x) 비교해봄으로써, 모델이 어떤 feature를 중점적으로 활용해보았는지를 알 수 있음!

•

Instance template: “Input: $x Label: ”

•

Label verbalizer as v(0) = “0” and v(1) = “1”

** Prompt Example

⇒ Input: positive & movie Label: 1 Input: negative & food : 0 Input: positive & food Label: ???

⇒ Input: positive & movie Label: 1 Input: negative & food : 0 Input: negative & movie Label: ???

•

이를 정량화하기 위해 다음과 같은 metric을 고안

→ h1 acc + h2 acc = 1 (as disambiguation data is balanced)

→ h1 acc와 h2 acc의 차이: feature bias

3. Data Construction

•

4개의 NLP Task에 대해서 dataset 구축

◦

기존 label(e.g., positive/negative in sentiment classification)을 default feature로 사용

◦

선행 연구, meta data 등등 사용해서 spurious feature 지정 (→ binary classification으로 전환)

•

Semiment Anaylsis (IMDB, YELP)

•

Toxicity Classification (CivilComments)

•

NLI (MNLI)

•

QA (Book QA)

4. Experiments

Experiment Setup & Models

•

Models

◦

Davinci (GPT3)

◦

Text-davnici-002 (Instruct-Tuning)

•

Evaluation Protocols

◦

16 shots

▪

h1(x)=h2(x)=0 (8 shots) & h1(x)=h2(x)=1 (8 shots)

▪

3 random seeds with 1200 test sets (h1(x) = 0/h2(x) = 1 and h1(x) = 1/h2(x) = 0)

Without Inductive Bias

•

Sentiment analysis에서는 semantic feature에 대한 bias가 높음

•

Toxicity에서는 davinci/instruction-tuning 둘다 특정 feature bias가 없음  

= demonstration을 기반으로 label을 예측할 때 어떤 feature에 영향력을 더 받는지 알 수가 없다

(data 특성상 당연해 보이기도..? toxicity & gender & lgbtc & muslim & uppercase → 1 & 0)

•

MNLI나 BOOKQ 같은 경우에는 Instruction tuning을 해준 LLM은 label과 관련된 feature를 선호하기 시작함

•

Insturction-tuned model이 label intended feature를 더 선호하는 경향이 있음

→ instruction tuning때 유사한 문제들을 풀어봤기 때문이라고 이야기함 (모델 내재적으로는 task is not underspecified)

⇒ LLM의 feature bias가 풀고자 하는 task와 일치하지 않으면, 성능이 하락함

With Inductive Bias

: 이제 여러 종류의 inductive bias를 주면 prior feature bias가 바뀌는지 알아보자.

•

Setting

◦

Baseline : simply concatenate demonstration examples as the prompt, and use “1” and “0” as the verbalizer. (기존이랑 동일)

⇒ 아래 중 하나를 각각 Baseline에 더한후 h-acc 변화를 확인함

◦

Semantic verbalizer: target하는 feature의 label만 natural language label을 줌

(h1를 targeting하면 1→positive, 0→negative)

(h2를 targeting하면 1→movie, 0→food)

Verbalizer Examples

◦

Instruction setting: prompt 앞에 target하는 feature와 관련된 instruction 추가

Instruction Examples

◦

Explanation: Input과 Label사이에 target하는 feature와 관련된 Rationale 추가

Explanation Example

◦

Disambiguation setting: demonstration절반을 disambiguation example로 대체

(h1로 모델을 steering하고 싶으면, h1(x)≠h2(x), h1(x)=y인 sample로 대체 (positive =1 & food =0) )

⇒ 이건 Task를 Specified하게 만드는 장치 (label intended signal을 더 주는 것)

•

Overall Result

(Sentiment analysis는 이미 semantic feature 잘 잡아서 제외하고 실험)

•

Instruction-Tuning 안한 모델은 Explanation & Verbalizer가 특정 feature로 더 쏠리게 하는데 도움을 줌.

•

IInstruction-Tuning 모델은 (당연히) Instruction이 

•

둘다 Demonstration을 통해 Model이 Input-dist(h1이든 h1든 더 많이 등장하는 쪽으로) 전반의 Bias를 학습함

⇒ prompt로 inductive bias를 주는게 unambiguous demonstration examples를 주어서 임의적으로 분포를 뒤트는 것보다 효과적이다.

(Real world setting을 고려하면, label intended intervention을 직접주는게 우리가 사전에 spurious feature를 전부 알고 demonstration을 그에 맞게 설계하는 것보다 현실적으로 괜찮은 해결책임)

•

ICL은 underspecified demonstration 상황에서는 (inductive bias가 없이는 더욱)

⇒ ICL might work in part by recognizing existing tasks rather than directly learning the input-output relation from the demonstration.

When are interventions effective in Text-Davinci-002

•

이미 LLM에 intended/spurious feature 이미 강할때 inductive bias로 강하게 ICL은 더 그 feature에만 집중해서 예측을 한다.

•

Interventions are effective when the model has a low feature bias. (Toxicity Classification)

(한 feature에 대해서 큰 선호도를 드러내지 않을때, inductive bias가 들어가면 ICL은 그 feature에만 집중해서 예측을 한다)

•

LLM이 가진 강한 Prior feature bias를 overriding 하지는 못한다. (NLI, QA, Sentiment 이미 prior: h1으로 steer > h2로 steer X)

(spurious feature < intended featured인 상황에서 inductive bias가 spurious feature에 집중하라고 하면 이걸 overriding하지는 못한다.)

•

MNLI dataset에서 Model에게 어떠한 feature(h2)를 개입시켰을때 성능이 더 올라갔는지 측정한 지표

→ (당연히) genre feature가 반응 더 잘한다.

정리해보면,

•

Instruction Tuning을 하지 않는한 LLM의 feature preference는 instruction이 아니라 demonstration으로 control해야 한다.

•

Instruction Tuning Model은 Inductive Bias로 Spurious Feature Control 가능성이 있다. 

•

Semantic과 관련 없는 feature들 (overlap, length…)은 어떤 intervention으로 control해야할 지에 대한 consensus가 없다

5. Conclusion

•

InstructGPT model prefers the “default” task features over distractor features more often than the base GPT-3 model

⇒ 여러가지 Intervention으로 함부로 Feature 쓰는 것을 어느 정도 control할 수 있다.

** Underspecified Demonstration이 아닌 경우가 많을텐데, 그 경우에 생기는 다른 수 많은 bias들을 해결하는게 future work가 아닐까?

(ex. 변호사 판례 분석, 변리사 특허 출원/유사성 검사 같이 전문 domain으로 갔을때 instruct + ICL 조합이 쓰일텐데 이런 Real World LLM에서는 Underspecified Demonstration이 안쓰일거 같은데 이런게 future work 주제가 될 것 같음)