Rethinking Interpretability in the Era of Large Language Models

1. Introduction

Interpretability

‘주어진 결과’에 대해서 model weights/feature를 해석하는것

•

LLMs의 가장 큰 문제점

◦

낮은 혹은 불가능한 Interpretability → 고위험 분야 (medical) application이 불가능함

→ elicit trustworthy interpretation

•

 LLMs은 Nature Language Generation을 통해 기존의 Interpretate ML Techs보다 더 elaborate한 explanation을 제공할 수 있음

◦

‘’’Can you explain your logic?, Why didn’t you answer with (A)?, or Explain this data to me.,’’와 같은 rationale query와 LLM의 데이터를 근거화하고 처리하는 기술을 결합하면 이전에는 이해할 수 없었던 모델 동작과 데이터 패턴을 사람이 이해할 수 있는 텍스트로 직접적으로 표현이 가능해진다.

(limitation: hallucination, cost 문제는 여전히 존재)

2. Backgrounds and Definitions of Interpretability

Previous ML Interpretability

•

명확한 정의가 이루어져오지 않았기 때문에 기존 연구들은 feature attribution, saliency maps, and transparent models에 라는 narrow한 관점에 맞춰져서 연구가 진행되어옴

→ LLM은 더 broader한 scopre을 가지고 있음

LLM Interpretability

(기존 ML 관점은 weight나 feature관점에서 해석했지만 LLM이니 KG 관점에서 보려는게 아니띾 싶음)

: extraction of relevant knowledge from an LLM concerning relationships either contained in data or learned by the model.

(= data 혹은 모델이 학습한 데이터외 관련된 지식를 LLM 에서 extraction)

•

위와 같이 정의를 설정할 경우 cover할 수 있는 범위

→ LLM 자체를 해석하는데에도 통용되는 정의 (parametric KG)

→ LLM으로 explanation을 생성하는데에도 통용되는 정의 (explainability)

LLM

10~B parameters & massive pt data로 학습된 LMs

(e.g., PaLM24, LLaMA12, and GPT-4)

Evaluating LLM interpretations (여기서 평가하는건 Explanation을 의미함)

•

Human Evaluation

: real-world setting에서 human과 함께 interpretation이 주어졌을때 desired output으로 개선되는가? (interpretation=rationale 접근으로 보는거 같음)

•

Automatic Evaluation

: LLM 자기 자신을 scoring하는데 활용 (LM systematically scoring its own outputs too positively 조심)

→ 해결하기 위해서 ‘특정 문제에 맞춘 구조화된 평가 프로세스의 일부로’ LLM을 사용하면 된다고 함

•

Alter/improve model performance in useful ways

: 모델 성능의 개선 (유용한 explanation → improving accuracy at downstream tasks)

: explanation으로 LLM이 PT때 학습된 shortcuts/spurious correlations이 개선되었는가?

3. Unique opportunities and challenges of LLM interpretation

Unique opportunities

natural-language interface

•

natural-language interface로 복잡한 pattern을 설명

•

인간이 독립적으로 해석하기 어려울 수 있는 다양한 modality(gene, image)와 연결의 용이성 (NL로 다양한 수준의 세부사항, counterfactual) 

interactive explanations (인간이 중간에 개입해서 해석하거나 확장적인 explanation을 만들 수 있음)

•

사용자들은 자신의 독특한 필요에 맞춰 설명을 조정

(하나의 query sample에 대해서 분석을 수행하거나 추가 질문)

•

하나의 LLM explanation call을 [reason sub 1, reason sub2] 이렇게 분리 후 다른 LLM 호출로 분해할 수 있게 하여, 독립적으로 감사를 진행할 수 있음 → 분석 근거 제공

Unique challenges

hallucinations

•

정확하거나 근거가 없는 설명 (그 근거가 pre-training corpus를 통해 parameterized 되었냐 유무는 중요하지 않음, 결과론 적으로 정확하냐 근거가 없냐가 더 중요함)

•

hallucinations을 식별하고 대처하는 기술은 LLM Interpretability의 성공에 매우 중요.

immensity and opaqueness

•

모델은 수십억 개의 파라미터를 포함하고 있으며 계속해서 크기가 커지는 중. 이는 인간이 LLM 1개 단위를  직접 조사하거나 이해하기 어렵게 만든다는 것을 의미함.

•

단일 토큰을 생성하는 것조차 상당한 계산 실제(proprietary model)비용을 발생시키기 때문에, 해석을 위한 효율적인 알고리즘 필요.

4. Explaining an LLM

(LLM을 어떻게 설명할 것인가?)

Local explanation (One generation : 생성문 1개로 LLM을 설명해보는 것에 목표를 두는 Line of works)

•

Explaining a single generation (하나의 output) from LLM 

◦

LLM (transformer)를 활용해 Input token에 대 feature attribution 분석 (다양한 분석 방법 적용가능)

◦

Faithfulness/effectiveness가 unclear하지만 attention map도 explanation 제공

◦

single generation에 대한 single explanation을 직접 생성 (예측을 명확히 하고, 불확실성과 같은 미묘한 차이점을 표현 but hallucination된 explanation 제공)

◦

LLM이 COT를 통해 single explanation을 만들면서 single generation을 연쇄적으로 생성하도록 할 수 있음 (LLM의 추론 과정을 사용자에게 전달, LLM이 reasoning step을 따르도록 강제할 수 있음) [이 논문에서는 hallucination을 줄일 수 있는 방법이라고 소개]

◦

RAG: LLM이 의사 결정 시 사용하는 증거(text emb로 명시적으로 가져왔기 때문)를 더 쉽게 설명가능

Global and mechanistic explanation (LLM 전체 작동원리를 설명해보는 것에 목표를 두는 Line of works)

•

Explaining LLM as a whole (LLM을 전체적으로 이해하기 위한 설명 방법)

(편향성, 개인정보 보호, 안전성 등 일반화할 수 없는 우려 사항에 대해 모델을 audit)

•

Probing techniques → (decoding embedded information: 문장 내의 syntax 잘 파악하는가? & ㅅtesting model’s capabilities on precisely designed task: subject-verb agreement)

•

Probing은 analysis of attention heads, embeddings, and different controllable aspects of representations을 포함함.

•

MLP Layer, Attention Head가 어떤 역할을 하는가?

◦

Groups of Attention head가 어떻게 묶여서 specific task를 수행하는가?

◦

MLP layer가 factual knowledge를 localize하는가? (완전한 circuit을 설명하기 보다는 어떤 Layer의 어떤 position의 MLP layer가 activate되어서 특정 KG에 대한 대답이 생성되는데 영향을 미친다.)

→ 여전히 large scale로 확장해서 연구하기 힘듦. (완전 작은 scale의 transformer에서 통계적으로 분석 → scale law 적용해보기도 함)

•

LLM의 훈련 데이터 분포가 존재할 경우 → long-tail/repeated data 존재로 model behaviors를 설명해보고자 함

•

위에서 설명한 모든 Interpretability 시도들은 ‘LLM-based interactivity’로 더더욱 개선이 될 수 있음

◦

chat형식으로 왜 그런 대답을 했는지 계속 rationale을 도출시키게 하면서 개선시킨다.

◦

왜 model이 그런 output을 return했는지 스스로에게 되묻게 만드는거

→ 위에서 얻은 분야의 Insights는 아래와 분야에서 사용될 수 있음

model editing, improving instruction following, and model compression

5. Explaining a dataset

(dataset을 설명하는 LLM을 통해서 LLM Interepretability를 바라본다.)

Tabular Data

•

LLMs이 직접 code를 실행함으로써 dataset visualize & visualize함으로써 dataet explanation에 기여할 수 있음 (ChatGPT CodeInterpreter)

•

LLMs이 Tabular Data에 fitting된 classical한 ML based Interpretable model을 분석해서 dataset에 대한 explanation을 얻어내는데 활용

Text Data

•

LLM을 활용해 fully interpretable model (linear, decision) 구축

◦

interpretable model (which features (i.e. words or ngrams) are important for predicting different outcomes)

•

LLM을 활용해 partial interpretable model 구축

◦

chain of prompt활용: model이 dataset내 모든 example을 관통할 수 있는 특징을 가진  ‘single tree of explanations’을 generate할 수 있도록 prompting / single chain of prompts을 통해 LLM의 self-verification 유도

→ hallucination 방지를 위한 prompting 방법론 필요

6. Future research priorities

논문에서 제시하는 LLM Interepretation의 연구 우선성

Explanation reliability

•

LLM이 제공한 explanation이 reliable한가? (hallucination)

◦

prediction output does not entail the model’s predictions (shortcut)

◦

explanation not factually grounded in the input

Dataset explanation for knowledge discovery (Datamining)

•

Dataset explanation → 데이터에서 새로운 지식을 생성하고 발견

◦

인간 연구자가 선별하거나 테스트할 수 있는 과학적 가설을 brainstroming할 수 있음

(LLM에게 dataset 태우고 가설 몇개를 끄집어 내는것)

◦

실제로 화학 화합물이나 DNA 서열과 같이 불투명한 영역의 데이터를 이해하는데 LLM 쓰는 관련연구 참조 걸어놨음

Interactive explanations

•

LLM explanations, 그리고 follow-up questions을 바탕으로 한 interactive interface를 통한 explanation 제공

•

dialog 형식으로 LLM이 explanation을 잘 제공한다던가 확인 가능