SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures

1. Introduction

•

인간이 어떻게 추론하는지 연구하는 ‘cognitive theories’에 영감을 받아 LLM의 reasoning capabilities를 향상시키려는 prompting 연구가 활발

◦

Chain-of-Thought (ZS/FS): how humans solve problems step-by-step.

e.g.) Let’s think step by step.

◦

Decomposition-based prompting:  how humans breakdown a complex problem into a series of smaller subproblems, and then solve those subproblems one by one.

e.g.) How can I break down this problem into smaller, more manageable parts?

◦

Step-back prompting :how humans reflect on task nature to derive general principles.

e.g.,) Use Reflective Thinking: Step back from the problem, take the time for introspection and self-reflection. Examine personal biases, assumptions, and mental models that may influence problem-solving, and being open to learning from past experiences to improve future approaches.

•

Limitations:  각 prompting method 자체가 주어진 task를 처리하는 방법에 대한 reasoning process를 암묵적으로 미리 가정.

→ motivation: Task를 잘 풀기 위해서는 Task 고유의 reasoning structure를 잘 파악할 수 있는 prompting이 필요하지 않을까?

2. Self-Discovering Reasoning Structures for Problem-Solving

•

인간이 새로운 문제에 직면했을때, 가지고 있는 지식이나 기술 중에서 어떤 것이 문제를 푸는데 있어서 유용한지를 탐색한다. 

•

머릿속에서 그 문제를 풀기 위해 떠올린 지식과 기술을 조정한 다음,  문제를 풀기 위한 추가 지식들을 더 연결지어서 문제를 푼다. 

→ 이 pipeline을 LLM에 그대로 녹여냄.

Stage 01

•

meta-reasoning으로 tsk를 해결하기 위한 intrinsic structure 도출

#### 1. Select

•

Meta-prompt: psp_sps​

•

reasoning module descriptions (#39) : DDD

•

unlabeled test examples: tit_{i}ti​

→ Select a subset of reasoning modules

D_{s}

that are useful for solving the tasks

D_{s}= M(P_{s}||D||t_{s})

#### 2. Adapt

→ tailoring each selected module to the task at hand.

•

EX) break the problem into subproblems” → “calculate each arithmetic operation in order”

D_{A}= M(P_{s}||D{s}||t_{s})

#### 3. Implement

→ operationalizes the reasoning modules

D_{A}

into an implemented reasoning structure

D_{I}

with specialized instruction on what to generate on each step.

•

Meta prompt PIP_{I}PI​ & Demonstration ShumanS_{human}Shuman​

Stage 02

•

Stage 01에서 도출한 task의 intrinsic structure template을  task 모든 instance를 푸는데 사용

◦

“Follow the step-by-step reasoning plan in JSON to correctly solve the task. Fill in the values following the keys by reasoning specifically about the task given. Do not simply rephrase the keys.: 라는 Instance + Reasoning Structure + Task Instance를 Model에 Forwarding

◦

Model이 Value를 채우면서 Task를 풀도록.

A = M(D_{s}||t)

3. Experimental Setup

Tasks

•

BIG-Bench Hard (BBH) (23)

◦

1) Algorithmic and Multi-Step Arithmetic Reasoning, 

◦

2) Natural Language Understanding, 

◦

3) Use of World Knowledge, and 

◦

4) Multilingual Knowledge and Reasoning.

•

Thinking for Doing (T4D) 

◦

where models must leverage mental state reasoning to determine actions to perform

◦

GPT-4가 COT로 50%

•

MATH (200)

◦

instance-level로 reasoning structure generate

Models

•

GPT-4 (gpt-4-turbopreview) (OpenAI, 2023b)

•

GPT-3.5-turbo (ChatGPT) (OpenAI, 2022)

•

instruction-tuned PaLM 2-L (Anil et al., 2023)

•

open-source LLM Llama2-70B

Baselines (zero-shot setting임)

•

Direct Prompting: reasoning step 없이 바로 answer 생성하도록 prompting

•

CoT: reasoning step 생성하도록 prompting

◦

Let’s think step-by-step

•

Plan-and-Solve: LM이 plan을 generate하고 문제를 해결하도록 prompting

◦

Let’s devise a plan and solve the problem

Baselines with RAW Reasoning Modules

•

CoT-Self-Consistency: seed를 바꿔서 reasoning path를 여러개 생성해 answer를 생성한 다음 majority vote로  frequently하게 answer를 final answer로 select하는 방식

•

Majority voting of each RM: 39개 Reasoning Module prompt를 각각 붙혀서 answering을 생성하고 majority vote로  frequently하게 answer를 final answer로 select하는 방식

•

Best of each RM: 39개의 Reasoning Module중 가장 좋은 Acc를 보인 Reasoning Module prompt의  Acc (i.e.,1개 prompt의 upperbound)

4. Results

4.1. Does SELF-DISCOVER Improve LLM Reasoning?

→ task에 대한 reasoning process를 암묵적으로 미리 가정하고 prompting을 하는 COT/PS대비 여러 reasoning process/module중에서 선별적으로 task에 맞게 structure를 구성해서 문제를 푸는 Self-Discover pipeline이 당연하게도 더 좋은 성능을 보임.

4.2. Which Types of Problems Do SELF-DISCOVER Help the Most?

→ PaLM2-L에서 실험한 결과 sports understanding, movie recommendation, and ruin names과 같은 World KG를 요구하는 reasoning task에서 Direct/CoT 대비 가장 뚜렷한 성능 향상을 보였다.

→ 저자들은 World KG에서 가장 성능 향상이 뚜렷할 수 있었던 이유를 Self-DISCOVER가 다양한 Prompt를 하나의 reasoning structure로 통합하기 때문이라고 설명한다.

4.3. How Efficient is SELF-DISCOVER?

•

SELF-DISCOVER의 효율성을 보이기 위해 reasoning process을 여러번 생성하는 방법론과 직접적인 비교.

→ Big Bench Hard의 Movie Recommendation & Geometric Shapes라는 Task에서 비교 [GPT-4]

→ y-axis는 task내 평균 acc, x-axis는 instance당 필요한 inference 횟수

→ CoT-Self-Consistency로는 10번의 독립적인 reasoning path를 만들어서 answer를 aggregate, Majority voting each RM은 매 instance마다 40개씩 reasoning path에 대해서 majority vote.

→ Best of each RM*: 정답 label에 액세스할 수 있다고 가정하고 모든 reasoning path을 적용했을 때 가장 높은 정확도

→ SELF-DISCOVER는 1(instance)+3(task)의 inference call을 요구. 효율적이면서 높은 average acc를 달성할 수 있는 방법론.

5. Deep Diving Into Self-Discovered Reasoning Structures

5.1. Importance of SELF-DISCOVER Actions

→ Ablation studies on Stage 01 [GPT-4]

→ CoT에 Select, Select & Adapt만 적용해서 Reasoning Modules Prompt를 Tailoring해줘도 성능향상이 있음

→ Select → Adapt → Implement를 해서 task specific한 structures를 만들어주는게 가장 뚜렷한 성능 향상을 가져다줌

5.2. Towards Universality of Discovered Reasoning Structures

→ PaLM 2-L에서 Prompt2를 Optimize(OPRO vs SELF-DISCOVER)하고 GPT-4에 적용해서 성능을 평가

→ OPRO는 training example들을 활용해 prompt를 optimization하는 방법

→ SELF-DISCOVER가 3/4 task에서 transferability 능력이 더 좋고, OPRO는 training data 20%나 썼기 때문에 더 효용성이 뛰어나다라고 주장. (근데 이건 정확하게 비교 X라 명확하게 말 못하겠음)