Mitigating Label Biases for In-context Learning

Introduction

→ ICL of Large Language Model is highly sensitive to

•

choice of demonstrations

•

order of demonstrations

- 4개 demonstration을 가지고 Positive를 예측하는 ICL (SST-2)

GPT-3 2.7B라는 한계라는 점도 있지만 (이거 크다고 sensitivity가 크게 개선되지는 않다고 논문에서 지적함)

Ground Truth : Positive

PPPP → P

NNPP → P

PPNN → N

NNNP → P

PPPN → N

NNNN → N

Categorizing Label Biases in ICL

•

Vanilla-label bias: pre-training때 자주 등장한 단어들에 받아서 bias가 생기는 경우 (uncontextual to prompt)

•

Context-label bias: context-prompt에 의해서 영향을 받아서 bias가 생기는 경우

◦

order (recency)

◦

task template

[verbalizer(input+label → free form text), attention distribution]

•

Domain-label bias

“의사가 진단서를 기반으로 어떤 환자가 (1)아픔 (2) 건강함을 판단한다고 하자”

일반적으로 아픈 환자들이 진단서를 끊을 확률이 높기 때문에, 진단서는 (1)이랑 높은 상관관계를 보일 것이다.

→ ‘진단서에 있는 단어’ ‘(1)아픔‘에 spurious correlation이 생겨버림

- Hate Speech Detection Task처럼 domain specific한 data의 label에 대해서 classification을 수행할 경우, ramdom ENG word을 가지고 classification을 할때는 bias가 없지만, in-domain(dataset내 단어를 가지고 sampling) word를 가지고 classification을 할 경우 bias가 생김

→ Domain-label bias를 정량화하기 위한 metric을 고안

= 해당값이 클수록 bias가 높음 (ramdom English word 대비 해당 domain 단어를 사용했을때 이유없이 특정 class로 쏠리는 경향이 강하니깐)

→ LLM으로 갈수록 domian label bias에 더더욱 대응을 하지 못함 (오히려 작은 모델의 대응 능력이 더 좋네…?)

#### 이전에 제시된 Contextual Calibration:

1st demenonstration 1st Answer

2nd demonstration 2nd Answer

N/A

가 들어가면 (0.5, 0.5)가 나올 수 있도록 하는 Weight를 학습

•

Vanilla-label bias: pretraining때 knowledge막 꺼내오는거 방지 가능

•

Context-label bias: context 맥락 막 활용하는거 방지 가능

→ N/A가 domain에 따라서 label이 치우치는건 고려하지 못함

Domain-context Calibration

•

‘content-free example text’를 활용해 domain에 치우친 과하게 쏠린 확률을 calibrating하자!

Test set으로 Bag-of-Word를 구축

unlabeled texts가 평균적으로 L의 길이를 가지고 있다고 하면 Bag-word에서 L개의 ramdon text를 sampling한다.

아래의 prior를 계산 (random text는 모든 class에 속할 수 있는 단어들이며, frammtacially하게 의미있는 단어들은 아니기 때문에 calibration하기에 적합하다)

- M = 20

C가 Prompt(Demonstraiton들이라고 할때) 아래의 calibrated 확률로 ICL 진행

Experimental Setting & Result

•

Dataset: 24 text classification datasets

•

Backbone: GPT3-J & 175B

•

Shot: 8 (5 different random seed)

•

Baselines

◦

random performance

◦

uncalibrated performance

◦

contextual calibration performance

•

Metric: F1

Main Result

#### domain-context calibration generally improves in-context learning, especially on tasks with large domain-label bias(Poem, Finance, Tweet~).

####On all datasets, DC consistently boosts the performance for both models with an average improvement (Macro-F1) of 20% (GPT-J) and 18% (GPT3).

#### GPT-J에 대해서 높은 domain-label bias를 보였던 데이터셋에 대해서 (Tweet 3개) DC shows the Scale Law with low variance (baseline들에 비해서)

#### Shot 늘어나는게 다른 방법에 비해 도움이 되긴하나 그닥..?

#### GPT3(175B)나 Instruct-GPT3(text-davinci-002)에 instruct로 ICL한 경우에도 DC가 Baseline보다 좋은 성능을 보임 (Tweet 3개 데이터셋에 대한 평균)

Analysis

#### Calibrating with random texts of the average input length is beneficial (input text 자체의 context에 반영된 bias를 제거해주는 역할을 해주는 것 같음)

→ context prompt bias