Search

SELECTION-INFERENCE: EXPLOITING LARGE LANGUAGE MODELS FOR INTERPRETABLE LOGICAL REASONING

Category
PaperReview
Venue
ICLR 2023
Backbone
LLM-7B
Text
PPT

Connectionism vs Symbolism

[symbolism 측면에서 사과를 해석하려면 전문가들에 의해서 미리 정해진 ‘상징체계’를 어떠한 규칙에 의해서 조합해야 되는 것이다]
→ Deeplearning = Connectionism인데 설명력이 부족하니깐, connectionism과 symbolism를 둘다 적용해보자! 해서 나온게 Neural symbolic AI 이다!

기존 Reasoning의 문제

post-hoc rationalism
: rationale이 answer과 동시에 혹은 사후에 생성되기 때문에 rationale이 answer에 dependent 되지 못함
(Context: ’일출 직전이다.’ & ‘수탉이 울다’ → R ‘해가 뜨기 전에 수탉이 울기 때문에’ & A’수탉이 해를 뜨게한다’)
Chain-of Thought
: rationale을 생성하고 rationale을 바탕으로 answer을 생성.
(잘못된 rationale을 생성하는데도 answer가 맞는 경우 다수 발생 (see the appendices of Wei et al. (2022) for examples) & rationale을 make up 함)
→ Causal Reasoning이 진행되지 않고 있음을 알 수 있다!

Related Work

기존 Reasoning을 정리해보면,
1.
Produce Final Answer Directly (Reasoining은 내제적으로)
2.
Reasoning을 한번에 명시적으로 생성하는 방법 (Entailment & COT)
a.
1보다는 성능이 좋으나 multiple step reasoning에서 causual한 reasoning을 하는지는 알 수가 없음
b.
irrlevant하거나 fake rationale을 만들지만 correct answer를 맞추는 경우 다수
3.
Reasoning을 Step by Step으로 생성하는 방법 (ProofWriter & Selection.Inference)
a.
ProofWriter: ‘Prove this statement to be True/False’와 같은 문장에만 대답이 가능함
b.
매번 iterative하게 생성된 rationale을 검사하게 때문에 expensive

How well Do LLM Reason?

→ LLM이 한번에 Rationale을 생성해서 정답을 만들 경우 (1) prone to rationale make up (2) but correct in answer decision
그전에 LLM이 Reasoning을 얼마나 잘하는지 체계적으로 알아보자
→ Reasoning Task에서 고려할 부분은
the number of reasoning steps required,
presence or absence of negation,
whether the relevant context information was provided,
whether the model is required to evaluate the accuracy of multiple choices or generate the answer among others.
→ Experimental Setting: decoder-only LLMs of various sizes in a 5-shot2 setting (Min et al. (2022) have demonstrated that additional shots beyond 5 result in limited increase in multi-choice accuracy), following the same protocol use for the BigBench evaluation in Rae et al. (2021), on a larger set of 46 tasks.
→ Result:
the performance of vanilla language models tends to decrease when they are presented with irrelevant facts alongside the ones relevant for reasoning (e.g. see 2WikiMultiHop With Context tasks, bAbI tasks 2-3 or ProofWriter tasks),
when they have to infer the relevant facts from memory (e.g. 2WikiMultiHop or StrategyQA tasks), and as the questions start to require more steps of reasoning (e.g. see the performance drop between bAbI tasks 1-3 or ProofWriter Tasks).
Logical Task는 다른 Task와 달리 Scale Law가 완벽하게 들어맞지는 않음

THE SELECTION-INFERENCE (SI) FRAMEWORK

Reasoning Setting:
each question is accompanied by context information which contains all the information necessary to solve the problem,(문제를 풀기 위한 정보가 context에 다 있다..) as well as potentially irrelevant distractors. In the future this assumption can be relaxed
all questions are well posed and definitively answerable given the context.
contain mostly deductive and a small number of inductive problems.
Some problems require multiple steps of inference, where later steps use the knowledge inferred in the earlier steps. (한번 reasoning step이 꼬이면 안됨 … > Pattern 따는 모델은 못 푸는 문제)

Selection-Inference

Selection Step

- Prompt, Context, Question 기반으로 가장 relevant한 statement 문장 빼오기
- K is hyperparameter

Inference Step

- Few shot Setting은 유지하되, Selection에서 뽑은 근거들을 가지고만 연역적 사고 진행
- 모델이 외부적인 요인 (Irrelevant Context. Question)에 접근할 가능성을 차단했기 때문에 Hallucination의 강능성이 낮음

Experimental Result

evaluate our SI framework on a subset of 10 /46 logical reasoning tasks
generation setting (model should generate answer)
Prompt example
The selection prompt:
""" Here are a collection of stories about people carrying objects from one room to another . You will be asked where any object is. To answer this question you need to figure out who last had the object and which room they have the object in by the end of the story. Here are some examples:
Story:
at t=0 mary grabbed the football there
at t=1 daniel got the apple there
at t=2 mary went to the kitchen
at t=3 daniel journeyed to the office
at t=4 daniel went to the bedroom
at t=5 mary moved to the garden
Question: where is the apple?
Reason: at t=1 daniel got the apple there. We know that at t=4 daniel went to the bedroom
...
at t=0 john moved to the bathroom
at t=1 john travelled to the office
at t=2 john picked up the football there
at t=3 john journeyed to the bathroom
Question: where is the football? Reason: """
The inference prompt:
""" at t=1 daniel got the apple there. We know that at t=4 daniel went to the bedroom. Therefore, the apple is in the bedroom.
... at t=2 john picked up the football there. We know at t=0 john moved to the bathroom. Therefore, ""”
Outperform Big Model
Deduction 성능. 처음에 selection하더라도 다시 뽑을 수 있음

Finetuning Selection Inference

→ Selection LLM : Conetext랑 Question이 주어지면 "sent 2. We know that sent 4 [and sent 7]*." 가 주어지도록 학습
Label Form을 사용한 이유는 미리부터 setence를 생성하도록 하면 make up & cheating 위험 있음 (모델이 자기가 아는 거 아무거나 조합해서 꺼내오는건 해결 못하는건가…)
→ Inference LLM : 가져온 Sent기반으로 추론 (dependency of the selection은 유지)
Result
CherryPicked