Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

1. Limitation of previous autmomatic metrics

→ Rouge, BLEU, METEOR: are insufficient to measure the factual correctness of summaries and fail to correlate with the human judgements of factuality (Falke et al., 2019; Kryscinski et al., 2019).

→ summary의 길이를 고려하지 않고 binary classification으로 source와 consistent / non-consistent하다고 판별하는것은 한계가 있음. (binary로 분류할 경우 crowd-expert agreement에서 low agreemnet가 나옴 = (명확한 기준이 제시가 안되어 있기 때문에) human evaluation에서 관한 주관이 개입된다)

→ 모든 factual error가 동일한 가중치를 갖는 것도 아니고, summary 내 error의 개수에 따라 perceived factualityt에 영향을 받을 수 있다. :multi-class labeling을 위한 logical background

→ fine-grained하게 분류를 해놓으면 binary classification 성능 향상에 도움이 됨

2. Typology of Factual Errors

→ summary가 여러 문장으로 이루어져 있다.

→ discourse markers describe relations across propositions → casuality & temporal ordering introduce inconsisties with the article.

#### Semantic Frame Errors (summary내 특정 한 부분이 틀려서 inconsistency 발생 : source의 문장 하나하나를 의미를 제대로 학습하지 못했기 때문에)

A semantic frame is a schematic representation of an event, relation, or state, which consists of a predicate(술어) and a list of participants, called frame elements.

•

Predicate Error (PredE): Category PredE encompasses errors where the predicate in a summary statement is inconsistent with the source text. (술어가 틀리면 의미 자체가 틀려버림)

•

Entity Error (EntE): Category EntE captures errors where the primary arguments (like entities) of the predicate are wrong or have the wrong attributes, although the relation was expressed in the original text. (술어의 대상이 되는 주어/목적어가 본문이랑 다를 경우)

•

Category CircE captures errors where one or more such attributes (non-core frame elements within a frame) are wrong. (술어의 주변부인 location/time이 본문이랑 다를 경우)

#### Discourse Error (erroneous links between discourse segments: 문장 간 연결이 매끄럽지 않아 inconsistency가 발생하는 경우: source의 네 여러 문장 간의 관계를 제대로 학습하지 못해서 발생한 문제)

•

Coreference Error (CorefE): Category CorefE accounts for errors where pronouns and other types of references to previously mentioned entities either are incorrect or have no clear antecedents, making them ambiguous. (대명사나 reference가 incorrect하거나 없는 선행사를 참조하는 경우)

•

Discourse Link Error (LinkE): Category LinkE encompasses errors involving a discourse link between different statements. These include errors of incorrect temporal ordering or incorrect discourse links (e.g. RST relations, discourse connectors) between statements. (문장 간 시간/인과 관계를 잘못 나열 한 경우)

#### Content Verifiability Errors (summary cannot be verified against the source text due to difficulty in aligning them to the source)

•

Out of Article Error (OutE): Since summaries of a document should only contain information that can be deduced from the original text, we include a category for such errors OutE (prior work refers to this as extrinsic hallucinations (Maynez et al., 2020)). = extrinsic hallucination

•

Grammatical Error (GramE): We use GramE to categorize statements that are not well formed. When grammatical mistakes make the meaning of a statement incomprehensible or ambiguous, it cannot be verified against the source and is thus considered trivially wrong. (문장이 grammatical하게 틀리면 source로 부터 verified 될 수 없음)

#### Others (OthE)

#### Not an Error (NE)

3. Dataset Creation

#### CNN/DM과 XSUM에 Fine Tuning된 모델들이 위에 정의된 Typology를 어느정도 포함하고 있는지 human evaluation으로 파악하였다.

→ cnn/DM : 5 MODEL * 250 OUTPUTS

→ XSUM : 4 MODEL * 200 OUTPUTS

4. Summarization Model Analysis

→ 60% 정도 summary가 최소 1개의 factual error를 가지고 있다.

→ Xsum dataset으로 학습된 모델들이 CNN/DM으로 학습된 모델보다 더 factual error 많이 보이고 있음

→ Model마다 Error분포 다른건 model들이 너무 옛날꺼라 생략

5 Factuality Metric Evaluation

→ 위에서 human evaluation한 dataset으로 reference summary랑 비교해서 scoring 했음 (어떻게 했는지는 모르겠음…)

→ automatic한 metric이 실제 human evaluation한 dataset들을 잘 잡아내지 못하고 있음

→ Rouge-L 이 External Hallucination을 잡는데 유리함

→ OpenIE is more correlated with semantic frame errors (문장 내 특정 한 부분이 틀림)

→ BERTscore는 Semantic Frame Error (문장 내 특정 한 부분이 틀림)

→ QGA는 corefernce error & discourse error 잡지 못함

→ FEQA는 semantic error, content verifiabilitiy (summary cannot be verified against the source text due to difficulty in aligning them to the source) 잡지 못함

→ Entatilment metric은 semantic frame & content verifiability 잘 잡는 편임 (아무래도 task 자체가 문장 1개 & 1개 비교해가면서 단어 하나하나의 의미를 비교해가는 task이니깐)

→ DAE가 discourse Error 잘 잡음

→ FactCC는 discourse Error 잘 못 잡음