What In-Context Learning “Learns” In-Context: Disentangling Task Recognition and Task Learning

1. Introduction

•

No Consensus on how ICL works

◦

Some

▪

Xie et al., (2022): LLMs implicitly learn tasks required for downstream applications during pre-training and the in-context demonstrations merely provide information

▪

Min et al. (2022): ICL performance is insensitive to the usage of ground-truth labels.

◦

Others

▪

Akyürek et al. (2023); von Oswald et al. (2022): construct theories that Transformer-based models may perform implicit gradient descent to update an “inner-model”

•

이 연구에서는 ICL의 작동과정을 model-scale & num-of-demonstrations에 따라 작동과정을 더 세세하게 뜯어보기 위해 다음과 같은 2개의 작동과정으로 나누어 보았다.

◦

Task Recognition (TR)

▪

demonstration를 통해(ground truth label이 없어도) task를 인식하고 x_text에 prior knowledge를 적용하는 것

◦

Task Learning (TL)

▪

Pre-training에는 등장하지 않은 Input-Label Mapping을 통해 새로운 task를 배우는 것

→ 일반적으로 gold-label이 demonstration에 주어진 경우 TR과 TL이 동시에 적용되지만, 통제된 실험(TR이나 TL중 하나만 작동) 을 통해 저자들은 ICL behaviors를 파악하고자 한다.

2. Task Recognition and Task Learning

LLM \ is \ parameterized \ by \ \theta \\D_{demo} = (x_{1}, y_{1}, x_{2}, y_{2}, ... ,x_{k}, y_{k}) \\y_{test} = p_{test}(y|D_{demo},x_{test})

where \ demonstration \ elicit \ mapping \ f : x > y

2.1 Task Recognition

•

(x,y) pair없이 k개의 x input distribution과 y label distribution만을 보고 f를 mapping으로 인식하는 능력

•

f를 mapping으로 제대로 인식 했다면, 모델은 x_text에 prior knowledge를 적용해 y를 맞출 수 있어야 한다.

•

Task Recognition만이 적용된다면, 다음과 같은 수식이 적용 가능함.

→ wrong mapping이 demonstration으로 주어져도, 정답이 어느정도 맞는게 가능함

2.2 Task Learning

•

LM이 demonstration에 나오는 새로운 input-label mapping을 얼마나 잘 학습하나? (wrong mapping주어지면 치명적임)

‘’’’ 저자들은 TL이 TR(pre-training prior 가져오기)보다 어렵기 때문에, 두 가지 메커니즘은 별도의 조건에서 발생한다고 가정한다. ‘’’’

1. 상대적으로 작은 LLM에서도 TR은 잘하지만 model-scale & num-of-demostration이 커진다고 하더라도 그 능력이 강화되지는 않음
2. TL능력은  model-scale & num-of-demostration이 커짐에 따라 커짐

위의 현상을 보여주기 위해 (다른 연구들과 마찬가지로) Label-Space를 조작함

•

GOLD: the standard ICL setting where we use natural prompts and gold input-label pairs. This setup reflects both TR and TL abilities.

•

RANDOM (=TR only): similar to Min et al. (2022), we use the same natural prompts as GOLD and sample demonstration labels uniformly at random from the label space. This setup reflects TR only. • 

•

ABSTRACT (=TL only): We use minimal prompts (which provide no task information) and characters with no clear semantic meanings (e.g. numbers, letters, and random symbols) as the label for each class. 

Pre-training때 학습한 Input-Label Mapping 영향을 ICL때 받지 않게 하기 위해, 위의 설정을 만듦.

no clear semantic meaning이라고 생각했던 기호들이 biased가 있었던 경우가 있어 (e.g.0→negative) random하게 input마다 배정하지는 않았고 (당연히 task 정보도 안 누설되게 prompt도 최소화), (x,y) pair에 맞게 random하게 배정했다고 함.

‘Abstract Setting에서 사용한 Prompt’

random mapping된 label을 얼마나 잘 예측했냐 (acc)로 ABSTRACT의 성능을 평가했다.

Illustrative Example

3. Experimental Setup

Datasets

Sentiment analysis

•

SST-2 (Socher et al., 2013)

•

financial_phrasebank (Malo et al., 2014), 

•

emotion (Saravia et al., 2018)

•

poem_sentiment (Sheng and Uthus, 2020)

Topic/stance classification

•

TREC (Voorhees and Tice, 2000)

•

tweet_eval_atheist,

•

tweet_eval_feminist (Mohammad et al., 2018; Basile et al., 2019)

Toxicity detection

•

tweet_eval_hate

•

ethos_race

•

ethos_gender

•

ethos_national_origin

•

ethos_religion (Mollas et al., 2020)

Natural language inference/paraphrase detection

•

SICK (Marelli et al., 2014)

•

SNLI (Bowman et al., 2015)

•

WNLI (Levesque et al., 2012)

•

MRPC (Dolan and Brockett, 2005).

Models

•

GPT (ada (350M), babbage (1.3B), curie (6.7B), and davinci (175B))

•

OPT (2.7B, 6.7B, 13B, 30B, and 66B)

•

LLAMA1 (7B, 13B, 33B, and 65B)

Task Setup

→ 3 templates for each task

→ 150 test sets for GPT3

→ 1350 test sets for other models

4. Results

Task recognition is a broader capability across scales.

→ Random Setting은 Model size가 커지거나 num-of-demonstration이 증가하더라도 성능이 크게 증가하지 않는다. (saturation)

(작은 모델들의 파란선. 큰 모델의 파란선)

→ 작은 모델(350M)도 8개 demonstration만 있다면 semantic prior knowledge에 의해서 예측이 가능하며, 그 정도는 175B이랑 크게 다르지 않는다.

Task learning is enabled with scale.

→ 작은 모델의 TL능력은 num-of-demonstration이 증가하더라도 개선되지 않는다. (좌열 주황색선)

→ 반면, 큰 모델들은 num-of-demonstration이 증가하면 performance가 개선된다. (우열 주황색선)

(OPT 66B의 경우 16개 demonstration만 썻고, 새로운 label mapping인데도 gold-label로 mapping한 ICL setting보다 성능이 좋은 경우가 있었음. TL > TL & TR)

→ LLAMA 65B & 32 demons의 TL 성능이 나머지 > 65B LLM & 32 demons의 TL 성능보다 낮아서 ‘Task learning is enabled with scale’에 주장을 강조하고 있음

Further Analysis

→ Abstract Setting에서 number, letter, and symbol가 모두 같은 trend를 보이나, number, letter가 보다 pre-traing때 자주 등장하고 ‘언어’스럽기 때문에 성능이 좋았을것이라고 주장함.

→ NLI가 Sentiment Analysis에 비해서 Abstract curve saturation이 심한데, 저자들은 ‘minimal prompt’를 사용했음에도 prompt에서의 문장배치, prefix등등에서 pre-training prior가 어느정도는 적용되는 영향이 있어서 그러지 않았을까?라고 추측함.