Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning (Reasoning을 위한 Prompting)

1. Introduction

•

Reasoning abilites는 LLM의 performance를 측정하는 결정짓는 요소 중 하나

•

LLM의 reasoning abilities를 극대화하기 위한 기법들 중 논문에서 대표적인 것 2개를 소개함

Chain of Thought

→ Manual annotator에게 final answer을 도출할 수 있도록 step-by-step reasoning process를 적어달라고 하고 해당 reasoning process를 few-shot example마다 삽입

→ LLM이 few-shot내 reasoning process를 test sample에 적용하기를 기대 : 8개의 manual CoT prompt를 구축하고 실험 진행해보니 실제로 성능향상이 있고 그러하더라.

Self-Consistency

→ CoT Prompt를 짠 후에 Ask the LM to generate a diverse set of possible solution

→ if some of them lead to the same answer, then its more likely to be the correct answer

→ random seed 고정하지 않고 forward 여러번 진행 - diverse set 생성, diverse reasoning path ⇒ Majority vote

→ consistency를 uncertainty estimate를 위한 정량적 지표로도 쓸 수 있을것이라 시사

#### 두 방법론은 모두 주어진 fact나 rule을 그대로 활용하면서 final result까지 logical chain을 만드는 Direct Reasoning Framework임

•

하지만 당연하게도 Direct Reasoning Framework는 In-Direct Reasoning을 요구하는 Task에서 working하지 않음 (e.g., Negation을 활용)

•

Motivation: Direct Reasoning Framework으로 풀기 어려운 Task에서 사람처럼 In-Direct Reasoning을 수행할 수 있되, 논리력은 Direct Reasoning과 동일한 Framework/Prompting 방법론이 존재하지 않을까?

→ 이러한 In-Direct Framework/Prompting 구축을 위해 2가지 기본적인 테크닉을 활용함

◦

명제의 동치 (contrapositive equivalence)

◦

Proof by Contradiction

2. Related Works

•

Indirect Reasoning Task와 같이 복잡하거나 multi-step reasoning을 요구하는 task에서 LLM의 reasoning abilities는 여전히 좋지 못함. 이를 개선하기 위한 연구의 line은 아래와 같음.

Fine-Tuning based methods

•

(X_train, y_train)를 (X_train, y_rationale, y_train)으로 labeling후 LLM이 y_rationale, y_train을 동시에 생성하도록 supervised training. (y_rationale: human or LLM generated)

→ Limitation: “catastrophic forgetting”, 때문에 downstream task에 대한 generalization 능력이 약화될 수 있음

Tool-based methods

•

non-parametric external knowledge나 domain specific knowledge를 활용해 LLM의 reasoning capabilities를 향상시키는 방법론 

→ 대표적인 방법론: RAG

CoT-based methods ***

•

LLM이 reasoning abilities를 잘 활용할 수 있도록 prompt를 설계해주는 방법론

◦

Zero-shot CoT

▪

LLM이 reasoning abilities를 활용하도록 prompt에 적절한 instruction(지시사항)을 추가해주는 방법

◦

Few-shot CoT

▪

reasoning process가 포함된 few-shot examples들을 포함시켜서 LLM이 test case에서도 reasoning abilities를 발현해 문제를 풀도록 하는 prompting [Examples들을 어떻게 선택하느냐, Reasoning process를 어떻게 작게 나누어서 풀게 만들것이냐 등등]

→ 이전에도 언급했듯이 대부분의 연구들은 주어진 Fact를 그대로 활용하면서 reasoning process chain을만들도록 prompt가 설계가 되었음.

→ Negated Fact에 대한 reasoning을 요구하는 In-direct reasoning task에서 성능이 좋지 못함 → 해결해보는 prompting을 제안!

3. Preliminary

어떤 명제에 대한 직접 증명이 어려울때 간접증명이 활용되는데 그 중 논문에서 차용하는 방법론은 ‘Contrapositive’과 ‘Contradiction’이다.

Contrapositive

“If p, then q”, we can also know that if ~q then ~p.”

Contradiction

•

Proof-by-Contradiction (귀류법): original statement와 negation을 활용. original statement가 True임을 증명하고 싶을 때 negation에서 contradiction을 이끌어내 original statement가 True임을 증명

¬(p → q) ⇔ ¬[(¬p) ∨ q] ⇔ p ∧ (¬q).

Problem Setting

#### Factual Reasoning

: Natural Language로 주어진 Fact, Rule이 있을때 LLM이 주어진 Question에 대해서 Reasoning Path P를 만들어서 Answer set내 {True, False, Unknown} 중 하나를 고르는 setting.

#### Mathematical Proving

: Fact와 Question만 있고, 주어진 Rule은 따로 없음. Rule은 LLM의 pre-trained KG에서 가져와서 문제를 풀도로고 setting.

4. Methodology

→ 파이프라인을 한문장으로 요약하면 다음과 같다. (1)contrapositive equivalence를 활용해 Rule set에 있는 rule을 augment하고 (2) Augmented Ruleset, Question, Fact를 모두 활용하면서 Contradiction의 원리를 이용하는 prompt를 만들어 Indirect Reasoning을 하자.

Rule Augmentation

•

Fact: Bob does not drive to work

•

Rule: If the weather is fine, Bob drives to work

→ LLM은 The weather is not fine이라는 결론에 바로 도달하기 어려움. 사람은 간접추론을 통해 가능함.

•

Using the contrapositive equivalence in ‘Rule’ : If Bob does not drive to work, the weather is not fine.

→ 라는 추가적인 Rule이 주어지면 The weather is not fine이라는 결론에 도달하기 수월해짐.

•

Few-shot Prompting을 써서 Rule set에 대한 contrapositives를 얻어냄

Indirect-Reasoning (Prompting)

•

Zero-shot Prompt와 Few-shot Prompt를 고안

#### Zero-shot Prompt

[Q]에 augment 적용한 것 같음 / 아니면 Rule이 있으면 augmentation하는것 같음

: 문제간 풀이의 유사성이 떨어지는 Mathematical Proving Task에는 Zero-shot Prompt를 적용

→ 주어진 모든 조건과 negated된 question간에 교집합이 하나라도 있으면 False 뱉도록 prompt를 설계 (Contradiction)

#### Few-shot Prompt

: Prompt 초입에 Proof-by-Contradiction으로 문제를 푼다고 명시. Fact, Augmented Rule, Question을 명시 후에 Answer에는 ‘The negated of the original question is ~ Question’. 을 명시. 그 다음에 Fact/Rule들을 활용해 negated Question에 contradiction이 있는지 유무를 판단. 만약 negated question이 false=contradiction일 경우, original question은 true이다. negated question의 contradiction을 밝히지 못하면 original question이 틀린 것.

→ [Rules]에 Augmented Rules 적용

DIR

P(A_s) = \frac{1}{M} \sum_{i=1}^{M} \mathbb{I}(A_i = A_s) \ A_i \in \{True, False, Unknown\}

•

Direct Reasoning Prompting이랑 Indirect Reasoning Prompting Forwarding해서 frequent answer voting.

•

When there are conflicting voting results → use LLMs to determine which reasoning is more reliable and choose that answer

6. Empirical Study

Backbone LLM

•

GPT3.5-turbo

•

Gemini-pro

Evaluation Metric

•

accuracy of answer (AA)

◦

AA=AN/NAA=AN/NAA=AN/N

•

accuracy of reasoning processes (AP) : 이걸 구체적으로 어떻게 측정하는지에 대한 reference나 세부내용이 전혀나와있지 않음

◦

AP=PN/NAP=PN/NAP=PN/N

•

overall accuracy (OA) (both the answer and the reasoning process are predicted correctly)

◦

OA=ON/NOA=ON/NOA=ON/N

Backbone Reasoning Mechanism

•

Chain-of-Thought

•

Self-Consistency

◦

5 diverse set

Dataset

#### Factual Reasoning

•

ProofWriter라는 데이터셋에서 Indirect Reasoning의 효용성을 볼 수 있는 150개 data samploing

#### Mathematical Proving

•

Contradiction으로만 풀 수 있는 35개 ProofMath 데이터셋 직접 구축

→ multiple reasoning path sampling이라는 self-consistency의 장점을 IR이 accidental sampling에서의 error 교정과 함께 극대화하면서 COT보다 좋은 성능을 보이고 있음

#### Hybrid dataset (+150 (need 1~5hop reasoning for fact) & +65 (65 mathematic proof problems that are suitable for DR))

#### Effect of Rule Augmentation (OA is the y-axis)

→ DR/IR모두에 있어서 Rule Augmentation이 효용을 보임

→ 주어진 규칙이 있는 어느 정도 갖춰진 Factual reasoning에서 좋은 성능을 보임