Fine-grained Hallucination Detection and Editing for Language Models

1. Introduction

•

(그림 오른쪽) 기존 hallucination detection work들은 

◦

hallucination을 사실인지 아닌지와 같은 단순한 이분법적 구분으로 분류

◦

entity 수준의 오류로 구분함

•

(그림 왼쪽) 

◦

단순 사실 관계를 따지는 entity는 single reference로 교정이 가능

◦

조작된 entity가 포함된 경우 multiple reference 검증이 필요함

⇒ 이를 위해 hallucination span를 정밀하게 식별하고 사전 정의된 분류법에 따라 다양한 유형으로 구분하여 refinement을 제안하는 새로운 시스템을 제안

2. Task Recognition and Task Learning

Hallucination in NLG

•

Summarization, Text Simplification, KG-Intensive Conversation에서는 일반적으로 Source txt가 있다고 가정하고 LM output이 Source txt에 얼마나 faithful한지를 측정

⇒ World KG를 generation하는 LLM시대에서 outdated된 연구

Detecting and editing hallucinations in LMs.

•

LM output (논문에서는 statement라 명명)이 factual한지 아닌지 이진분류 (arXiv.orgFactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs , ACL AnthologyFActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation)

•

Entity-level hallucination editing (arXiv.orgPURR: Efficiently Editing Language Model Hallucinations by...)

•

Detection에만 머묾 (arXiv.orgThe Dawn After the Dark: An Empirical Study on Factuality...)

⇒ span 단위에서 hallucination detection은 많이 연구되지 않았으며, 이를 위한 fine-grained taxonomy를 제안

Fact verification for human-written claims.

•

비슷한 연구계열로는 human written claims(wiki, sci-docs, news)의 fact verification system을 개발하는 분야가 있음

⇒ LM generated texts가 아닌 human-written claims에 대한 verification

3. Fine-grained Hallucination Detection

•

query: require factual knowledge

•

hallucination: world knowledge로 verify 불가능한 (1) actual errors (2) unverified statements 

3.1 Hallucination Taxonomy

→ prior work에 complexity를 더해서 taxonomy 정의

1 - statements that contradict world knowledge

2 - unverifiable statements

(1a) Entity : Contradictory entity errors are a sub-category within Type 1, where an entity in a statement is incorrect and changing that single entity can make the entire sentence factually correct.

(1b) Relation : Contradictory relation errors are another sub-category within contradictory statements where a semantic relationship (e.g., verbs, prepositions, or adjectives) in a statement is incorrect.

(1c) Sentence : Contradictory sentence errors refer to cases where a full statement entirely contradicts relevant evidence from the web, and cannot be solved via phrase-level edits.

(2) Invented : Invented errors refer to statements where the LM generates an entirely fabricated entity (허구의 entity) that doesn’t exist based on world knowledge. Fictional entities in creative work aren’t included.

(3) Subjective : Subjective errors refer to expressions about existing entities that lack universal validity (entity에 대한 주관적인 의견). These statements often do not contain facts and are influenced by personal beliefs or opinions.

(4) Unverifiable : These are statements where the LM output contains facts, but no retrieved evidence from the web can directly support or contradict the fact (web에서 가져온 evidence가 supported X) (e.g., private details).

•

Entity나 Relation는 교정이 가능한데, 나머지는 output에서 삭제해야함

3.2 Tasks and Metrics

•

input: xxx

•

model output: yyy

•

error type ttt exists in a sentence yyy (si∈ys_i \in ysi​∈y)

◦

e=(etext,etype)e=(e^{text}, e^{type})e=(etext,etype)

◦

ei∗t∈{TRUE,FALSE}e^{∗t}_{i} \in \{TRUE, FALSE\}ei∗t​∈{TRUE,FALSE}

→ i번째 문장에 t-type error가 있는가?에 대한 label

(의문) 1문장에 hallucination spot이 1개만 있다고 가정하는건가?

•

Hallucination Detection Metric.

•

Hallucination editing Metric.

→ f: factscore

4. Benchmark: FAVABENCH

•

아래의 Src로부터 각각 50개씩 prompt를 수집 (가능하면 information-seeking queries로)

•

ChatGPT, Llama2-Chat 7B, Llama2-Chat 70B로 각각 response 생성 (총 600)

•

아래처럼 annotation 진행	

◦

75.1% agreement in detection at the sentence level and 60.3% agreement in exact error type detection.

→ entity/relation은 edit 결과도 요청

•

model output의 error type distribution

→ entity가 가장 빈도수가 높음

→ No Robots 데이터셋의 prompt에서는 kg-intensive한 prompt가 없거나 이미 lm에 parameterized된 query가 많아서 output에 error 빈도수가 적었던 것으로 예상

5. Model: FAVA

•

Retrieve Relevant Documents

◦

C=Mret(x,y).C = M_{ret}(x, y).C=Mret​(x,y).

•

If possible, edits factual errors in y given the retrieved context

◦

y^=Medit(x,y,C).\hat{y} = M_{edit}(x, y, C).y^​=Medit​(x,y,C).

•

Medit.M_{edit}.Medit​.

◦

state-of-the-art proprietary LM → prompting → unstable

◦

open-src LM → need training data

5.1 Synthetic Training Data Curation

#### Seed passage generation.

•

NQ + Wikipedia 35,074 articles ccc 을 sampling한 후 ChatGPT 활용해서 paraphrase

→ adaptable and effective across various textual formats.

•

c→tc \rightarrow tc→t

#### Error insertion.

•

GPT-4 or ChatGPT를 활용해 ICL로 paraphrase된 article에 순차적으로 error type을 삽입

•

t→yt \rightarrow yt→y

#### Post-processing.

•

error tag와 clearn phrase 제거해서 FAVE input 생성

•

y→yy \rightarrow yy→y

•

t→y∗t \rightarrow y^*t→y∗ (paraphrase input은 ground truth output으로)

•

contriever를 활용해 c ∪{retrieved}=Cc \ \cup \{retrieved\} = Cc ∪{retrieved}=C

Training and Inference

#### Training.

•

−log⁡Medit(y∗∣C,y).-\log M_{edit}(y^*|C,y).−logMedit​(y∗∣C,y).

•

MeditM_{edit}Medit​: Llama2-Chat 7B (chat-model이 더 좋음)

#### Post-processing.

•

Contriever-MSMARCO로 relevant wikipedia 5개 retrieve

6. Experiments

•

test set: 902 annotated passages

•

baselines

◦

ChatGPT

◦

GPT4

◦

Rt+ChatGPT (FAVA와 동일한 top five retrieved documents by Contriever at test time to augment the original prompt 사용)

⇒ Sentence , Subjective , Entity Detection 성능은 다른 LLM에 비해서 좋음

⇒ (F1은 높지만) Invented나 Unverifiable은 still limited performance라고 표현함. 아마 2개 type은 더 많은 error type이 있어야 완벽한 검증이 가능해서 그럴것으로 판결

⇒ FActScore를 활용해 fine-grained가 아닌 binary classification 결과를 봐도 당연히 성능이 증가했음을 보임

⇒ Editing 된 후 결과를 평가한 결과 ChatGPT는 (추가정보가 없으니) 성능 증가가 없는데 FAVA는 editing 이후 FActScore 향상

⇒ top1 → top5 → reranking 식으로 retrieving에 공을 들일쓰록 성능이 증가하고, top-4 + ‘introductory par agraph of the target entity’처럼 lexical matching 활용시 성능이 극대화됨

7. Conclusions

•

automatic hallucination detection task 제안

•

hierarchically classifying hallucinations taxonomy 제안