Mutual Information Alleviates Hallucinations in Abstractive SummarizationType of the Paper

Introduction

•

Abstractive Summarization Setting에서 Hallucination이 발생하는 이유

◦

Fine-tuning하는 데이터셋이 잘못되어서 (Target - Source overlap이 애초에 X) 

◦

Model Architectures & Training이 잘못되어서 

◦

Generation하는 중 (Decoding) ← 기존 선행연구가 거의 없었다고 함

•

저자들이 세운 가정

(1) model이 (input과 무관하게) training corpus에 자주 등장하는 단어 높은 확률을 뱉는 경향 때문에 hallucination이 발생한다. (선행 연구에 따르면 이러한 경향은 다른 nlp task에서도 많이 밝혀진 바가 있다고 함)

⇒ decoding strategy를 건들여서 이를 해결하는게 본 논문의 contribution

Related Work

•

왜 Hallucination이 발생하는가?

→ Exposure bias: Training & Inference Discrepancy ⇒ Miminum Risk Training (tentative & NMT) & Data Augmtation, Retraining (Cost)

•

언제 Hallucination이 발생하는지 알수 있는가?

◦

NLI, token, sentence level 기반 detection (사후처리)

•

Decoding to avoid hallucination

◦

decoding시에 선험적으로 bias를 주어 hallucination을 방지하는 방식

▪

keyword를 summary에 강제로 등장시킨다던가 

[Constrained abstractive summarization: Preserving factual consistency with constrained generation]

▪

decoder source랑 similar한 token을 생성하도록 만들다던가

[Focus attention: Promoting faithfulness and diversity in summarization.]

Method

•

decoding strategy를 건들기 때문에 언제부터 hallucination이 생기는지 기준을 필요함

•

이를 규명하기 위해 저자들이 세운 가정

(2) hallucination은 model이 high uncertainity를 보이는 순간 시작된다.

(training data에서 자주 등장하는 token에 높은 logit값을 부여해 버리는 경향이 있어서)

•

Shannon entropy로 매 time-step마다의 model의 uncertainty는 다음과 같이 정량화

t: H(p(· | y_{<t}, x)) = − ∑_{y∈V}p( y | y_{<t}, x) log p (y | y_{<t}, x)

•

X에 기반을 두지 않는 LM의 output은 marginal probability로 정량화 

(hallucination을 일으키는 부분 = ‘training data에서 자주 등장하는 token에 높은 logit값’이라고 저자들이 믿는 부분)

f_{y}(y)=∑_{x∈X}f(x,y)

p(y | y_{<t})

•

 가정 (1) 을 해결하기 위해 Pointwise Mutual Information 기법을 도입함

score(y | x) = log\frac{p(x, y)}{ p(x)p(y)}, where \ score \ is \ log \\ = \frac{log p(y|x)}{ p(y)}\\ = log p(y | x) - log p(y)

•

LM decoding식에 붙히면 (Conditional - Marginal)

log p(y | y_{<t}, x) − log p(y | y_{<t})

→ 이 식에 대해서 직관적으로 풀어쓰면 LM output logit에서 X에 condition을 두지 않는 부분에는 penality를 주겠다로 해석가능)

•

가정 (2) 를 활용해 저자들은 특정 time-step에서 entropy값이 높으면 hallucination이 있다고 판단하고, pointwise score function을 쓰도록 decoding strategy를 설계 (λ, τ  hyperparameter)

score(y | y_{<t}, x) = log p(y | y_{<t}, x) \\− λ · 1\{ H(p(· | y_{<t}, x)) ≥ τ \} · log p(y | y_{<t})

Experimental Setup & Result

•

Dataset

◦

Xsum

◦

Testset 500개는 hallucination annotated

•

Model

◦

vannila Transformers

◦

BART

◦

LM: Transfomer LM

◦

대략적인 λset에 대해서 τ를 의 rough한 범위를 찾고  (1 step) 다음 100개의 λ, τ 조합에 대해서 hyperparameter search를 진행함

•

Main Result Table

→ BART의 경우 Rouge performance 유지하면서 FactScore도 향상시킴 (FactCC는 CNN/DM으로 학습된 Faithfulness metric이라 본인 setting이랑 맞지 않는다고 함)

•

500개의 token단위로 Hallucinaton labeling이 되어있는 test set sample에 대해서 CPMI를 적용했을시 token별 score, ranking((where the highest probability token is rank 1 and lowest probability token is rank |V|)의 변화추이

→ hallucinated token들의 경우 score가 떨어지면서 (이건 좋은 경향성), ranking은 올라감 (이건 이해가 안가네..?)을 좋게 평가하고 있음

→ Non-hallucinated token은 thershold로 조절 가능