Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

Introduction

이라크에서는 아직 대량의 살상 무기가 발견되지 않았다.

이라크에서 대량의 살상 무기가 발견되었다.

인간에게 1과 2가 동치인지 아닌지 판별하게 학습하게 하기 위해서는 다양한 데이터셋이 필요하다.

하지만 ‘이라크에서는 아직 대량의 살상 무기가 발견되지 않았다.’라는 문장이 주어졌을때, ‘이라크에서는 아직 대량의 살상 무기가 발견되지 않았다.’라는 문장은 옳은 문장인가?라는 식으로 질문을 바꾸어 버리면 인간은 한번에 빠르게 학습이 가능하다. (Motivation of Instruct Fine-Tuning)

→ 이렇게 추가적인 Prompt가 Input에 결합되면 모델이 Input으로부터 유의미한 task instruction을 해석할 수 있기 때문에 빠르고 안정적인 학습이 가능하다고 알려져있다.

→ 인간이 직접 쓴 Prompt가 자동으로 찾거나 만든 Prompt에 비해서 성능 향상에 도움이 된다고 알려져 있고 (report that Schick and Schütze (2021b)’s manually written prompts still on average outperform the automatically searched prompts across a range of SuperGLUE tasks (Wang et al., 2019)) 전문가에 의해서 작성된 Prompt에 의해서 유의미한 instruction을 작동시킬 수 있다고 알려져 있다.

하지만, 본 연구는 위의 성능향상이 few-shot 및 zero-shot 상황에서 230M-175B까지의 다양한 모델들과 instruction tuning된 모델들이 과연 (사람처럼) prompt 내의 instruction 제대로 해석했기 때문일지에 대한 의문을 제기한다.

Prompt Tuning & Prompting

앞으로 언급할 prompt tuning 및 prompting은 아래 3개 중 하나를 의미함

•

Discrete Prompts: {sent} In summary, the restaurant is [prediction]

→ 위 prompt로 tunining

•

Priming: ICL

•

Continuous Prompts(prompt tuning, p-tuning): In addition to discrete prompts, some models use continuous prompts that are generated using a separate language model. These continuous prompts are designed to be more flexible and can be tailored to specific tasks or domains. However, it is unclear whether these continuous prompts are better at conveying task-specific information than discrete prompts.

Experiment Setup

Problem Situation

•

Few shot 상황에서 Model이 Prompt내 Instruction의 의미를 얼마나 잘 이해하는지 실험

•

k-shot = {0, 4, 8, 16, 32, 64, 128, 256}

•

Prompt내에 있는 instruction의 범위를 ‘description of task’로 좁힘

Baseline Setup

•

Weak Baseline

◦

Small PLM Prompt-based Tune, Fine Tune 후 RTE valid에 대해서 few shot 평가

▪

ALBERT > BERT. DISTILBERT. T5. RoBERTa

▪

전체 dataset을 shot으로 주었을 경우 prompt-based tuning이나 fine-tuning이나 성능 차이 바슷함

▪

4-256 shot 기반 prompt-based model ALBERT 당첨!

•

Instruction-Tuned Model

◦

T0 3B / 11B

◦

T5-Lm Adapted

•

ICL

◦

GPT3 175B

Data

•

NLI (T0가 instruct tuning때 안봐서)

◦

RTE

◦

WINOGRAD

◦

ANLI

•

Label Space는 Yes/No로 통일하고 실험 진행

4개 Random Seed

Templates

논문의 목적이 모델이 Prompt내 Instruction의 Semantic을 제대로 이해했냐를 파악하기 위함이기에 5개 종류의 Template를 제작함

•

Instructive: how we would describe the NLI task to a human who has never seen this task before. (처음 NLI 문제를 보는 인간에게 설명하듯이 기술하기)

→ Prompt(Instruction) Tuning를 통해 모델이 인간이 Instruction을 보고 unseen task를 풀때와 같은 동작을 하기를 기대한다면 Instructive Prompt를 보았을때랑 아래 Prompt를 보았을 때 성능 차이가 뚜렸해야함

•

Misleading-Moderate (적당히 속이기): instruct the models to perform a task related or tangential to NLI such that, if the model were to perform the task as explicitly instructed, it would perform poorly on NLI in general. (NLI랑 비슷한 Task를 수행하도록 기술함. 기술한 그대로 수행하면 NLI 성능은 좋지 않을 수 있음)

•

Misleading-Extreme: instruct the models to perform a task unrelated to NLI. (NLI랑 무관)

•

Irrelevant: concatenate the premise, a sentence unrelated to any NLP task, and the hypothesis. (무관한 문장을 premise랑 hypo사이에 끼워넣기)

•

Null: concatenate the premise and the hypothesis without any additional text. (아무 정보도 넣지 않기)

Results

•

T0: Instructive Vs Irrelevant Template 

→ T0는 Instructive나 Irrelevant Prompt에 상관없이 Fast Training이 가능함

→ Instruction의 semantic을 학습하지는 않아보임

•

Misleading Template

→ ALBERT: Misleading-Extreme>Moderate한 템플릿을 주면 학습을 더 잘함

→ T5-3B: Misleading-Extreme>Moderate시에 학습을 더 잘함

→ T5 11B나 GPT-3는 합리적인 결과 (Instructive>Misleading>Extreme)

•

Null Template

→ 일반적으로 제일 성능이 안좋으나 특정 order template의 경우 32 SHOT에서 성능 좋은 경우 있음 (뭐.. 이럴 수도 있지..)

•

Zero shot

→ Zero-shot에서 random보다 marginal하게 성능을 보인 modeld은 T0밖에 없어서 T0로 실험을 진행

→ 3B모델은 Prompt 종류가 어떻든 간에 비슷한 수준의 performance를 보임

→ 11B 모델은 통계적으로 유의미한 performance 차이를 보이지 못함. 11B 모델부터 유의미한 차이를 보이기 시작함 (instructicve prompt에서 성능이 제일 좋지만) (그럼에도 misleading-extreme prompt에 여전히 너무 잘 반응함)

→ GPT3도 비슷한 양상 보임 (instruct tuning은 안했지만 사이즈 키운다고 해서 해결되는 문제는 아님)

Label Space

→ Label Space도 임의로 바꿔서 모델이 Label에 sensitive하게 반응하는지 실험함

•

Yes-no: Model is expected to predict the word “yes” for entailment and “no” for nonentailment. (기존 setting)

•

Yes-no-like: Semantically equivalent to yesno but using superficially different words, e.g., “true”/“false”, “positive”/“negative”.  (유의어)

•

Arbitrary: Model is expected to predict arbitrary words that have no semantic relation to the entailment task, e.g., “cat” for entailment, “dog” for non-entailment. (임의 단어로 mapping)

•

Reversed: Model is expected to predict the opposite of the (intuitive) yes-no and yes-nolike labels, e.g., (reverse)

Results

→ ALBERT, T0 둘다 Best Illustrastive template로 실험했을때 Yes-No > Arbitrary.Reversed

•

추가 실험도 진행

An irrelevant or misleading template + yes-no targets, e.g., {premise} Does the paragraph start with "the"? [yes/no] {hypothesis} : 

An instructive template + arbitrary targets, e.g., {premise} Based on the previous passage, is it true that "{hypothesis}"? [cat/dog]

→ An irrelevant or misleading template + yes-no targets의 성능이 더 높음. 인간이라면 몇개 shot만으로 Cat → Entatlilment / Dog → Not-Entailmenet라는 것을 빨리 Mapping할텐데 모델은 그렇지 못하고 있음. 오히려 잘못된 instruction을 전혀 해석하지 못하고 있음을 보여주고 있음. (T0는 instruction을 해석하도록 추가학습된 모델임에도 불구하고)

Conclusion

→ Model이 instructive and irrelevant templates, misleading templates에 따라서 performance 차이가 다르게 나야하는데 그렇지 않음 (인간처럼 instruction을 해석하지는 않음)

→ 반면 Target word에 따른 performance 차이는 consensus 존재

Additional Interpretation

•

Lack of Competence (너무 어려운 Task)

◦

Non-instruction Tuning model들이 zero-shot에서 random한 성능을 보이는게 Instruction을 해석하는 것을 못배워서인지 애초에 해석할 역량이 없어서인지 몰라서 Few-shot으로 진행했던거..

◦

모델이 애초에 Instruction을 무시하고 Spurious Feature를 꺼내와서 entailment relation에 대한 reasoning을 진행할 수도 있다..

•

Lack of Compliance (Instruction 무시)

◦

instructive and irrelevant templates make models learn significantly faster than misleading and null templates do (중간에 이상한 문장 끼워넣는게 misleading한 지시를 넣는거보다 빨리 배우기 시작함..)

◦

complex syntactic or semantic features한 것을 활용하는 것보다 spurious or heuristic features for predictions(irrelevant한 문장을 하나 끼워넣는게)을 사용하는게 모델 입장에서 instruction을 해석하는데 적은 노력이 필요하다. [의미적으로 확확 바뀌는게 모델 입장에서 더 구별하기 편하다는 것 같음]