Measuring Association Between Labels and Free-Text Rationales

Introduction

→ Model의 decision making이 과연 faithful rationale을 바탕으로 이루어졌는가를 분석한 논문

•

 기존의 INPUT(Question/Hypothesis)안에 표면적으로 존재한 rationale는 OUPUT까지 reasoning 하는데 한계

•

 fill in the gap을 위해 free text rationale를 포함한 dataset이 나옴

→ free text rationale를 활용해 추론하는 방법에는 크게 2가지가 있다.

Pipeline: INPUT(Question & Answer Candidiates) > MODEL 1 > Rationale  && Raitonale & Subset of Input (Answer Candidiates) > MODEL 2 > Answer

[Model2를 sufficiceny라고 하는듯: is sufficient to make a prediction on its own]

Self-rationalizing: INPUT(Question & Answer Candidiates) > MODEL > Rationale & ANSWER

→ 논문은 실험과 분석을 통해서 아래의 물음에 대한 답을 하고자 한다.

Q1. Pipeline과 Self-rationalizing 중 어느 것이 성능이 좋은가?

Q2. Predicted Label과 Generated Rationale간의 관계를 평가하는 방법이 없을까?

Dataset, Task, Evaluation

→ Dataset & Task

•

Free Text Rationale이 있는 Commonsense QA & SNLI

→ Model

•

T5 Model을 다양한 Setting에 맞게 훈련해서 활용

◦

I→R (inputs → rationales) 

◦

R→O (rationales → outputs)

◦

I→OR (inputs → outputs / rationales)

◦

IR→O (inputs / rationales → outputs)

◦

I→O (inputs → outputs)

•

비교하는 Model

◦

Pipeline : I→R;R→O 

◦

Self-rationalizing : I→OR

→ Evaluation

•

Rationale Evaluation

◦

token overlap으로 gold rationale과 predicted rationale을 평가하는건 plausibility, faithfulness를 측정하지 못함 (Rationale의 본질의 집중하자)

◦

Simulatability measures기법 차용: Rationale이 들어감으로써 Output에 대한 predictive power가 얼만큼 증가했는지 평가 (ie., IR→Ô minus I→Ô)

◦

Ô을 활용할 경우 degeneration이 생겨서 gold label O를 활용했다고 해서 이를 방지

→ 직관적 이해: 노이즈가 섞인 Input으로 만들어진 Rationale로 추정한 Output Acc는 훼손될것

→ Result

•

Self-rationalizing이 pipeline보다 성능이 잘 나온다.

•

하지만 Self-rationalizing이 I → O 으로 직접적으로 training한 모델보다 미세하게 성능이 떨어진다. 

(논문은 성능이 미세하게 떨어져도 모델이 추론능력을 학습했는지를 보여주고자 함)

Shortcomings of Free-Text Pipelines

→ Joint Model Rationales are More Indicative of Labels

•

I → R로 학습한 Model은 task-specific input이 주어지지 않을 경우 output을 맞추기 위한 rationale을 학습하기 어렵다

아래와 같은 Ontonotes Dataset에 Task Specific 없이

Input “The quick brown fox jumps over the laze dogs” → Rationale

을 infer할 경우 Model은 POS나 Syntactic Parsing중 어떤 rationale 뱉을지 모르기 때문에 같은 rationale을 만들도록 학습이 됨

•

Output과 Rationale을 같이 생성하는 모델이 output을 예측하는데 유용한 정보를 줄 수 있음을 실험적으로도 보임

R*→ O Train한 후 (I→R, I→OR에서 생성한 Rationale를 가져와 Acc 측정)

◦

test set ground-truth R∗ rationales

◦

test set rationales generated by I→OR

◦

test set rationales generated by I→R

•

본인들이 제안한 metric을 바탕으로 Output과 함께 만든 Rationale이 더 유용함을 보임 (higher is better)

→ Sufficiency is not Universally Valid

•

Ouput을 알지 못한 Rationale추론(sufficiency 가정)은 universally하게 valid하지 않다.

◦

IE Task처럼 일부 Token만이 정보를 추출하는데 유의미할 경우 sufficiency 가정이 유효할 수 있으나 reasoning은 아님

◦

애초에 Rationale은 Input → Ouput까지 가는데 있어서 ‘Complement’ 역할인데, R → O 모델로만 Output을 예측하는 것은 당연히 성능저하로 이어질 수 밖에 없음

◦

E-SNLI 성능 차이가 미미한 이유는 Rationale에 output에 대한 rationale이 명시적으로 완벽하게 포함되어 있기 때문. 

→ Pipeline을 쓰면 안되는 이유

•

cascading errors caused by low-quality rationales that are not indicative of labels

•

missing information due to rationales not being sufficient

•

double the number of parameters and more manual labor needed to reach comparable performance to an end-to-end (I→O) model; still often performing worse 

Analyzing Necessary Properties of Joint Models (Ablation Study)

(I → OR 모델이 얼마나 faithful한 rationale을 가지고 prediction을 하는가

predicted label과 predicted Rationales 사이에는 어떠한 관계가 있는가)

→ Robustness Equivalence

•

input에 noise가 가해졌을 때 output / rationale 중 하나가 틀리면 둘 사이의 관계가 없다고 말할 수 있음

(둘다 틀리거나 아니면 같은 pattern을 가지고 틀려야 faithful한 rationale을 가지고 output을 생성했다고 말할 수 있음)

•

N (0, σ^2)에 따라 σ^2를 늘려가면서 input embedding에 노이즈를 주어 output(accuracy)과 rationale(output predictive power)이 얼마나 손상되었는지 평가 

•

Noise가 적은 구간(0~15)에는 output과 rationale quality가 stable하게 유지되면서 하락하고 큰 구간(15~)에서는 unstable하게 떨어진다.

⇒ Noise에 따라서 같은 pattern을 보이므로 model이 생성한 output과 rationale은 관계가 있다고 볼 수 있다.

(검은색은 noise를 가해도 prediction이 변하지 않은 output의 비율)

[CoS-E dataset 기준]

→ Feature Importance Agreement

•

output과 rationale이 관계가 있다면 output 예측에 있어서 중요한 input tokens은 rationale 생성에 있어서도 중요함. 반대도 마찬가지 (important input token → output / rationale) [그 정도를 측정하는 과정]

•

중요한 token을 식별했으면, Remove and Retrain (ROAR) occlusion method을 통해 → 중요한 token에 noise가 가해졌으면 output/rationale이 비슷한 패턴으로 망가질 것

•

중요한 token을 어떻게 식별할 것인가?

◦

Gradient를 활용해서 input token → output / rationale 중요도 측정

: i번째 input token과 p번째 class logit과의 관계 (f=L1 사용)

: derivative → taylor series → decoding → output / rationale

•

Remove and Retrain (ROAR) occlusion method

◦

Random Token에 노이즈를 주었을때보다 Input과 관련된 중요한 Token에 Noise를 주고 재학습했을 때 성능 저하가 더 뚜렷함 (Acc)

◦

마찬가지로 Random Token에 노이즈를 주었을때보다 Rationale과 관련된 중요한 Token에 Noise를 주고 재학습했을 때 성능 저하가 더 뚜렷함 (Acc)

⇒ LM안에서 input token → output / rationale 중요도를 추론할 능력은 있음

◦

Random Token에 노이즈를 주었을때보다 Input과 관련된 중요한 Token에 Noise & Rationale과 관련된 중요한 Token에 Noise를 주고 재학습했을 때 Rationale Quality 저하 (SNLI dataset에서도 hold)

⇒ (논문에서 이에 대한 해석은 없지만) 당연히 noise가 크면 acc나 rationale quality가 dissimimilair하게 움직이지만 적은 noise하에서는 acc(output)나 rationale quality가 비슷하게 망가짐

⇒ output / rationale의 관계를 보임

[CoS-E dataset 기준]

→ 직관적 이해: 노이즈가 섞인 Input으로 만들어진 Rationale로 추정한 Output Acc는 훼손될것

[Figure 오른쪽 Rationale Quality 계산]

Conclusion

→ T5 base model은 faithful한 free-text rationales을 생성하는 능력이 있다.