ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data

1. Introduction

•

Zero-shot Setting에서 왜 pre-trainined LLM이 좋은 성능을 보이는가에 대한 주장은 크게 2가지가 있다.

pre-training때 기억했던 memory를 그대로 활용

pre-training때 reasoning하는 방법을 배워서 활용

→ 1.이나 2. 중 어떤 것이 맞는지는 차치하고, 임의의 downstream task에서 어떤 data가 사용되거나 추론되는지 알 수가 없다.

→ 즉, 좋은 성능의 증거가 되는 pre-training data가 무엇인지에 대해서는 아직 명확히 밝혀진 적이 없다.

•

이 연구는

◦

광범위한 pre-training data 중 특정 downstream task에 상대적으로 더 도움(=performance)이 되는 subset이 있다는 가정에서 출발한다.

◦

그 가정하에서 task-specific evidence subset을 가장 잘 찾는 방법을 제시한다.

•

이 연구가 이전 연구와 다른 이유는

◦

이전 연구는 test input(x_test)안에서 직접적으로 test performance에 도움이 되는 요소를 연구했거나

◦

test example 1개에 영향을 미치는 evidence subset이 무엇인가를 찾는 연구가 주를 이뤘기 때문이라고 한다.

→ test example 1개에 영향을 미치는걸로 전체 task-specific evidence subset을 찾으려고 하려는 시도는 infeasible.

2. Problem Formulation

•

일반적으로 curation된 Zero-shot / ICL Task에서 Targeting하는 data와 달리 pre-training data는 ‘mixed of domain’, ‘prone to noise’되었기 때문에 저자들인 어떤 pre-train data가 specific downstream task의 성능에 기여했는지 알고자 한다.

•

먼저 LM의 pre-trainin data를 다음과 같이 먼저 정의하자.

◦

AE 계열 : (Masked Input: Reconstructed Token)

◦

AR 계열 : (Conditional Input: Next Token to be predicted)

•

A language model θPT is trained to minimize a loss L over the pretraining examples, θ^PT = argminθ L(D^PT; θ).

•

D_task ) (x_task, y_task). language model은 task에 대한 decision을 다음과 같은 template을 적용해 평가한다. →  pθ(verbalizer(y_task) | template(x_task)).

•

우리의 목적은 D_task의 부분집합이면서

(1) D_task보다 크기가 작고

(2) D_task의 performance에 강하게 contribution하는 S를 찾는 것이다.

•

S를 찾았다고 하면, S의 효용성을 보는 방법에는 2가지가 있을 것이다.

S를 pre-training에서 제거하고 다시 pre-training 한다음 D_task에 대해서 zero-shot

S를 최소한으로 pre-training해서 model parameter를 D_task에 맞게 ‘boost’하기

•

저자들은 2.를 채택해 S를 mini-batch로 나누어서 업데이트

•

Boosted된 모델은 다음과 같은 metric으로 평가함.

•

Implications

◦

S^(c)가 task performance에 기여하지 않는다고 할 수 있는가? 물론 아니다. 문법적인 요소 같이 indirect한 부분에는 기여할 수 있지만 이 연구에서는 보다 더 direct한 task-specific evidence subset을 찾는게 목표다.

3. ORCA

•

저자들이 제안하는 방법론에 깔려 있는 intuition은 다음과 같다.

D_task로 continual pre-training하면 task에 도움이 된다는 가설하에, D_task처럼 model 파라미터를 변경시켜줄 D_task의 일부를 잘 sampling하면 된다.

(task에 도움이 되는) D_task로 continual pre-training도 iterative하게 이루어질 것이므로, S 역시 iterative하게 select할 것이다.

•

m iteration에 걸쳐서 S를 sampling한다고 가정하자.

task specific continual pre-training이 도움이 된다고 가정했기에,

D_task로 Batch Gradient를 계산하고

Pre-training data에 있는 data 1개 있는 데이터에 대해서 Gradient 계산

1.과 2.의 cosine-sim이 특정 값을 넘으면 iteration subset(e.g., S1)에 추가

Iteration subset(e.g., S1)을 가져와서 model을 update. 

위 과정을 반복

→ 4번 과정을 반복하면 cost가 크기 때문에 |Si|가 일정 크기 이상 쌓이면 그때 다시 update.

floor function이 없이 한개씩 바로바로 update하는 방식을 NL No-lagging이라고 한다.

4. Experimental Set-up

Baseline

•

Random Sampling :subset |S|를 random하게 sampling

•

Embedding nearest neighbors: 아래와 같은 방식으로 nearest neighbor 탐색

→ x^(PT)에는 [MASK] token에는 GT값을 투입하고, 해당 token의 last hidden state 값

→ x^(task)에는 GT label 값을 넣고, 해당 token의 last hidden state 값

→ D_task에서 t개를 samping하고 거기서 nearest k개를 sampling → t*k

( with r repetition)

→ t*k에서 |S|개를 선택

•

ORCA with embeddings: ORCA에서 gradient대신 embedding을 사용해서 searching.

Setup

•

Backbone

◦

BERT

•

Dataset

◦

IMDB

◦

MNLI

•

Pre-training data

◦

sample 0.5% of the full pretraining data (BookCorpus & Wikipedia).

◦

X^PT_context: sequence of 512 tokens

◦

X^PT_mask: single masked token in the sequence

•

ORCA

◦

m=20

◦

100 examples per iteration

5. Evaluation

Zero-shot Performance

Prompt-Tuning Performance

→ 이미 Prompt-Tuning은 Training-Example들을 활용해서 embedding layer가 학습되어 있기 때문에 pre-training data에 있는 useful signal이 더 많은 개선의 여지를 주기가 어렵다.

→ 그럼에도 불구하고 IMDB에서는 성능 향상을 보임

6. Analysis

Which source corpus does the supporting data evidence come from?

•

pre-training data에 wikipedia의 비율이 많았음에도 (76.5% vs. 23.5%) ORCA는 down stream task에 도움이 되는 (IMBD - BookCorpus consists of novels that could involve strong emotions and sentiments) & (MNLI - could be due to the selection of the colloquial verbalizer words (e.g., “yes”, “maybe”), which can be scarce in Wikipedia.) pre-training subset을 select함을 알 수 있다.

What are the masked tokens in the supporting data evidence?

•

Task의 Label space랑 관련이 없는 token들도 있지만 관련성 높은 많은 token들이 X^PT_masked로 지정되었다.

Is the context of the supporting data evidence similar to the task input data?

→ X^PT_context와 task input이 표면적으로 같은 단어들로 이루어졌다면, 단순히 ‘ X^PT_context’를 외워서 성능이 향상된 것일 수 있다.

→ 아니라면, X^PT_context에서 task로 지식이 transfer되었다고 할 수 있다.

→ (1) Task에서 example 2000개를 sampling하고, (2) X^PT_context에서 X^PT_mask 양 옆으로 truncated해서 (3) (1)과 (2)사이의 MAUVE(similarity score)를 측정하였다.

•

통일성 있는 결과는 아니지만, Random보다는 낮은 경향성을 보임.