Explanation-based Finetuning Makes Models More Robust to Spurious Cues

1. Introduction

#### Spurious Feature

: Label(offensive)과 casual 관계는 없지만 correlation 관계가 형성된 feature (@username)

•

Training Step에서 Spurious Feature-Label 학습하고, Inference때 발현되는게 문제

•

기존 LLM에서 spurious feature에 대한 의존도를 줄이기 위한 방법론은

◦

Model 구조를 건들거나

◦

Dataset를 조정하거나 (e.g. augmentation)

→ 하지만 위의 방법론들은 spurious feature에 대한 대한 사전 지식이 있어야 효과적인 성능 향상을 달성할 수 있다.

•

따라서 본 논문에서는, feature-agnostic하며 사전에 spurious feature에 대한 정보가 없어도 LLM이 spurious feature에 대한 의존도가 낮게 만들기 위해 explanation based fine-tuning을 제시한다.

•

모델이 (정답 label과 관련된) explanation을 먼저 생성하고 정답을 생성하도록 training되면, inference때도 이러한 경향이 유지되며, 그렇게 학습되지 않은 모델(not ft with explanation)에 비해서 4개의 데이터셋에서 성능 향상 및 spurious feature과 label correlation의 하락을 보여주었다.

(사실, 거의 동일한 연구를 진행한 연구가 있었는데도 붙는걸 보면 실험 설계를 잘하고 이걸 어필하는게 정말 중요하다는 것을 느끼게 해주는 paper)

2. Problem Definition

Why Spurious Feature is a Problem? & What Should We Focus?

•

Training시에는 spurious correlation이 강했는데

•

Test시에는 그렇지 않는 경우

•

Fine-tuning 방법이 potential fix이라고 보는데 문제는

•

어떤 spurious feature가 training시에 올지 모르니 feature-agnostic한 Fine-Tuning이 필요하다.

Specific Scope

•

Binary Classification

•

Generative LLM

3. Dataset Setup & Method

Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations

와 마찬가지로 해당 연구에서도 spurious feature가 문제가 될만한 setting을 잘 설계해서 연구를 진행했다.

Dataset Setup

All Training Instances가 있고, negative Iabel L0, positive label L1이 있다고 하면

positive label L1이면서 feature f+인 데이터, negative label L0 (500개) 이면서 feature f- (500개)인 데이터만을 선별적으로 뽑아서 spurious feature feature에 skewed된 데이터셋 D^f_{train}를 구축

(1.0 MCC dataset)

기존 Training Distribution에서도 1000개를 sampling (label 비율은 1:1)해서 D_{train}를 구축

기존 Test Distribution에서도 500개를 sampling (label 비율은 1:1)해서 D_{test}를 구축

Terminology & Metric

PLM ‘M’을 FT 방법론으로  D_{train}에 train 후 D_{test}에 evaluate → acc(M^(FT)_{base})

PLM ‘M’을 FT 방법론으로  D^f_{train}에 train 후 D_{test}에 evaluate → acc(M^(FT)_{f})

D^f_{train}에 train 후 D_{test}에 evaluate할 때 predicted label과 feature label 사이의 MCC → corr(M^(FT)_{f})

#### Core Metric

•

delta_{acc} of M,FT = acc(M^(FT)_{base} -  acc(M^(FT)_{f})

•

corr(M^(FT)_{f})

→ FT1이 FT2보다 feature f에 더 robust하다는 것은 delta_{acc} of M,FT1 > delta_{acc} of M,FT2하면서 corr(M^(FT1)_{f}) < corr(M^(FT2)_{f}) 하다는 것을 의미한다.

Fine-tuning Methods

•

Standard FT

•

Explanation-based FT

4. Experimental Results

Dataset

•

CREAK (Onoe et al., 2021) Given a claim, the task is to verify whether it is True (L1) or False (L0). 

•

e-SNLI (Camburu et al., 2018) Given a premise and a hypothesis, the task is to decide whether it is True (L1) or False (L0) that the premise entails the hypothesis.

•

ComVE (Wang et al., 2019) Given two sentences, the task is to judge which one of Sentence 1 (L1) or Sentence 2 (L0) is more plausible. 

•

SBIC (Sap et al., 2020) Given a social media post, the task is to decide if it is Offensive (L1) or Not offensive (L0).

Spurious Cues

•

Sentence Length: For inputs longer than this threshold, we consider the feature to be present (f+). 

•

Present Tense: If the POS tag of the first verb is VBP (present tense verb) or VBZ (present 3rd person singular), we consider the feature to be present (f+). 

•

Plural Noun:  if the POS tag of the first noun is NNS (noun plural) or NNPS (proper noun plural), we consider the feature to be present (f+). 

•

Embedding Cluster: clustering on the training set to assign inputs into two clusters, arbitrarily indexed as C0 and C1. If an input falls in cluster C0, we consider the feature to be present (f+). Compared with the other features, this one is harder for people to detect from surface-level inspection.

Main Result

•

D_{train} > D_{test}에서는 Standard보다는 성능이 미세하게 떨어지지지만 D^f_{train} >D^_{test}에서는 성능향상이 대체적으로 큼

•

ComVE 데이터셋을 보면 Explantion-based Tuning이 항상 절대적으로 높은 ACC를 보장해주는 것을 아닌 것을 알 수 있음

•

Prediction-Feature Correalation은 0.167 Standard 대비 -0.217로 줄여줌

Analysis

•

모델이 기존에 spurious feature에 민감하게 학습되는 (standard ft시에 prediction-feature correlation) 경우에만 (not ComVE but CREAK, e-SNLI), explanation based model이 효과가 있다.

•

즉, 모델이 애초에 spurious feature에 잘 반응을 안하도록 FT가 되는 데이터셋일 경우, explanation-based tuning은 오히려 역효과일 수도 있고 이는 작은 모델일 경우 그 효과가 더 클 수도 있다. 

(spurious feature에 반응하지 않는데 정답 생성과 상충되는 (spurious feature에 반응하지 않도록) explanation 생성이라는 task를 더 풀어야하기 때문에)

Further Analysis

#### Do explanations improve the robustness of models of different sizes and families?

•

GPT-3 4형제들 보면 크기가 커질수록 spurious feature가 추가될때 ACC 증가 및 Corr 감소 경향이 더 커지는 것을 알 수 있음

•

ADA 모델의 경우 No-Cue dataset에서 Standard FT대비 13.2 Acc drop 있다고 함

#### How does the spurious correlation strength affect our method?

→ D^f_{train}의 corr비율을 바꿔가면서 실험

•

0.8이상이 되어야 explanation이 효과를 발휘함

⇒ Spurious Feature가 문제가 되는 Setting에서 써야한다.

#### Does explanation quality affect the effectiveness of our method?

→ CREAK and e-SNLI에 대해서 같은 Label을 가졌지만 다른 instance의 explanation을 가져와서 FT

(To analyze the impact of explanation quality in our setting)

•

Explanation permute training해도 No explanation보다 결과가 좋음 → 저자들은 explanation이 spurious feature를 forgot해주는 역할을 해주는 역할을 해주었기 때문이라고 설명하지만 구체적인 이유는 후속연구로 부탁함

## CREAK Dataset

{
    'ex_id': 'train_1423',
    'sentence': 'Lauryn Hill separates two valleys as it is located between them.',
    'explanation': 'Lauren Hill is actually a person and [not] a mountain.',
    'label': 'false',
    'entity': 'Lauryn Hill',
    'en_wiki_pageid': '162864',
    'entity_mention_loc': [[0, 11]]
}

## DEV SET

{"ex_id": "dev_0", "sentence": "Eating soup means eating only solids.", "explanation": "Soup is mainly liquid sometimes with bits of food inside.", "label": "false", "entity": "Soup", "en_wiki_pageid": "19651298", "entity_mention_loc": [[7, 11]]}
{"ex_id": "dev_1", "sentence": "Many countries have outlawed oxygen therapy.", "explanation": "Oxygen therapy is [not] a controversial treatment.", "label": "false", "entity": "Oxygen therapy", "en_wiki_pageid": "508455", "entity_mention_loc": [[29, 43]]}
{"ex_id": "dev_2", "sentence": "Peter Sellers spent his life as a single bachelor.", "explanation": "He was married four times with three children.", "label": "false", "entity": "Peter Sellers", "en_wiki_pageid": "24518", "entity_mention_loc": [[0, 13]]}
{"ex_id": "dev_3", "sentence": "People use spaghetti to tie items together.", "explanation": "Spaghetti is used for food, [not] to tie things together. Spaghetti is too weak for that.", "label": "false", "entity": "Spaghetti", "en_wiki_pageid": "29178", "entity_mention_loc": [[11, 20]]}
{"ex_id": "dev_4", "sentence": "No voice actors sang in the Beauty and the Beast.", "explanation": "The Beauty and the Beast (1991 film) was [not] a silent motion picture, and the voice acting and musical numbers were well received.", "label": "false", "entity": "Beauty and the Beast (1991 film)", "en_wiki_pageid": "133462", "entity_mention_loc": [[28, 48]]}
Python
복사

(뇌피셜: 저자들이 feature-agnostic한 방법론들 추구한 이유와 일맥상통했던것 같은데, 사용한 dataset들의 explanation을 보면 왜 정답 label을 골라야하는지에 대한 설명이 나와있지, 왜 spurious feature가 아닌지에 대한 설명은 있지 않음. 그래서 적어도 label(false)이 같으면 비슷한 explantion pattern이 조금은 있을 것이고 (not이 조금은 많을 거고) 해당 영향이 있지 않을까 생각이 들었다… → 이것도 false spurious .. 이런거 아닌가..)

5. Conclusion

•

Spurious Feature 고른게 많이 별로였는데, 그거 제외하고는 굳

•

실험설계가 중요해진 요즘, 글전개와 더불어 배울게 많아 보임.