SmolLM2: When Smol Goes Big Data-Centric Training of a Small Language Model

1. Introduction

•

10 billion이상의 LM은 모델크기에 의해 필연적으로 사용제약을 일으키며, 자연스럽게 1~3B 모델에 대한 개발이 각광을 받게 됨

•

‘1~3B 모델의 성공 핵심은 어떤 데이터(data curation)로 학습을 했는가’로 귀결됨

◦

작은 파라미터를 가진만큼 일반화시키기 위해 ‘learning core knowledge and fundamental capabilities’하기 위한 데이터 필터링 작업이 그만큼 중요하다.

•

논문에서는 다양한 web source, data로 부터 엄밀한 crawling을 SMALL SIZE LM을 multi-stage training 함으로써 성능을 최대화하고자 함

2. Backgrounds

•

Pretraining

◦

objective: fit the structure of language and store factual knowledge

◦

key: composition of the pretraining dataset 

▪

webtext (filtering → reformatting → dedup)

▪

small specialized datasets (i.e., code.math) (high-quality datasets are incorporated later in training)

•

Instruction-Tuning

•

Preference Learning

3. Pretraining datasets

RQ. data mixture를 바꿀때마다 성능이 어떻게 변화하는가? (실험적 확인)

•

LM Architecture

◦

1.7B parameter Transformers

◦

sequence length of 2048

◦

2 ablation

▪

web only (CC): trained 350B tokens randomly sampled from the full dataset

▪

code/math: first trained 3T tokens (이후에 언급할) next trained on code(200B code)/math(60B math 40B web)

⇒ apply annealing

English web data

•

FineWebEdu: classifier trained on Llama-3-70B-Inst’s annotation

•

DCLM: fastText-base classifier (high-scoring posts in subreddit)

⇒ FineWeb-Edu filtering은 educational에 DLM은 conversational에 도움이 됨 :→ 비율 잘 섞어서 filtering해서 사용하자

Math data

	GSM8K	MATH
InfiMM-WebMath (40B * 1.5 epoch 60B)	14%	1%
OWM (12B * 5 epoch 60B)	10%	2.3%

⇒ (중복 학습이 있더라도) 60B를 학습시켰지만 SOTA LM 대비 성적이 아주 낮음

•

insufficient dataset sizes

•

insufficient focus on step-bystep mathematical reasoning

⇒ FINEMATH

•

a collection of up to 54B tokens of math data focusing on mathematical deduction and reasoning through classifier-based filtering.

•

Construction

Common Crawl WARC 파일에서 Resiliparse를 사용하여 텍스트를 추출 (FineWeb 데이터셋의 5.8B 고유 URL) → Llama-3.1-70B-Instruct 모델로 콘텐츠를 3점 척도로 평가 (1,2,3; 3점: 적절한 수준의 단계별 문제 해결 방법) → 2점 이상인 페이지가 최소 10개 이상 있는 도메인 식별 → OWM 또는 InfiMM-WebMath에서 최소 10개 URL이 있는 도메인도 포함

⇒ 7.1B 페이지, 6.5T token 획득 (InfiMM에서도 동일한 파이프라인 탐)

→ Llama-3.1-70B-Instruct으로 annotation 생성해 fine-grained classifier 학습 → 2차 필터링 → MinHash LSH → fasttext language detection ENG

⇒ FineMath4+ : 10B

⇒ FineMath3+ : 34B

⇒ Infi-WebMath4+ : 8.5B

⇒ Infi-WebMath3+ : 20.5B

[Results]

•

high quality를 filtering한 후 annealing하면 성능이 오른다.

•

FineMath는 corpus에서 가져와서 그런지 몰라도 Infi-Math는 수렴해버림,,

Code data

•

LLM의 coding abilities의 application 뿐만 아니라, code data의 효과성 (improves natural language reasoning and world knowledge)은 이미 자명하게 입증

•

Stack v1, ~3TB → filtered → StarCoderData (250 billion tokens)

•

Stack v2, ~32TB → filtered → StarCoder2Data (900 billion tokens)

•

논문에서는 StarCoder2Data를 한번더 filter

◦

작은 model representation에 최대한 fitted시키기 위해 (selected the 15 largest programming languages from StarCoder2Data)

◦

Llama3-70B-Instruct로 annotation을 만들어 StarEncoder model기반 15 language-specific classifiers를 학습 (educational score를 0~5로 평가)

[Results]

•

code data도 filtering하고 annealing하면 성능이 오른다.

(⇒ LM based annotation으로 학습한 classifier로 학습해도 작은 LLM에서는 오르네?)

4. Pretraining

•

smaller models일수록 긴 training duration을 가져야 한다.

◦

Qwen2-1.5B →7 trillion tokens

◦

Qwen2.5-1.5B → 18 trillion tokens

[Rules]

•

SmolLM2 → 11 trillion tokens (2epochs on collected datasets)

(fixed dataset mixture → pretraining X)

Performance-driven intervention

주요 벤치마크에서의 평가 지표를 지속적으로 모니터링 → 학습데이터 조정

Upsampling high-quality math and code during the annealing phase

FineMath와 Stack-Edu 같은 데이터셋을 최종 단계에 사용

Strategic introduction of medium-sized datasets

OWM, InfiMM-WebMath, Stack-Edu와 같은 중간 크기 데이터셋을 학습 중간에 도입 (초반에 CC knowledge에 의해 희석당하는거 방지)

Avoiding excessive data repetition

대부분의 데이터셋에 대해 권장 4-5회 반복 임계값에 가깝게 유지 (어느정도는 반복해서 학습해되, 너무 많지는 않게)

⇒ 사후적으로 data mixture 구성하면서 학습한듯

[Data Mixture & Training Stages]

(0T to 6T tokens) 0.9 English Web (60% FineWeb-Edu + 40% DCLM) + 0.1 StarCoderData

(math는 적어서 제외)

⇒ Knowledge/Reasoning에서 예상대로 성능향상

(6T to 8T tokens) 0.75 English Web (60% FineWeb-Edu + 40% DCLM) + 0.2 StarCoderData

+ OWM 0.5

•

knowledge을 유지하면서도 코딩과 수학적 추론에서 관찰된 부족함을 해결

⇒ Code 성능은 확실히 증가했으나, Math 성능증가는 미비

⇒ MMLU(MCF)에서 stage2에만 도달했는데도 random guessing보다는 높게 나옴

ref

⇒ FineWeb-Edu에 비해 DCLM의 비율을 증가시키면 MMLU(MCF)에 더 도움이 됨

⇒ 대화형 + 수학 데이터셋을 추가로 사용

(8T to 10T tokens) 0.74 English Web (40% FineWeb-Edu + 60% DCLM) + 0.16 Code (Stack-Edu(Main)+StarCoder2Data) + 0.1 Math (text-only English portion of InfiMM-WebMath with OWM)

•

Code는 고품질 데이터의 비율을 늘리고, Math도 절대적인 양을 추가하였다.

⇒ 해당 단계에서 loss spike가 많이 발생했다고 함… (rewinding 시도도 했지만 원인을 밝혀내진 못했고 어찌저찌 해결이 되었다는 이야기..)

(10T to 11T tokens) 0.754 (40% FineWeb-Edu + 60% DCLM)  + 0.24 Code (More Stack-Edu <higher python>) + 0.14 Math (InfiWebMath-3+, FineMath 4+, OWM, AUGSM8K)

⇒ Code랑 Mathd에서 성능향상

[Context Length extension]

•

stage 4의 마지막 75 billion tokens 학습전 2K → 8K

(40% long-context documents (8k tokens or more) sourced from DCLM (10%), FineWeb-Edu (10%), and the books subset of Dolma (20%))

⇒ final SmolLM2 base model

[Base model evaluation]

•

Math Code는 Qwen보다 못하지만 general knowledge는 더 잘한다. (Qwen은 뭐냐?)

[Post-training - Instruction Tuning]

•

Open-src Inst data로만 SFT하면 기존 open-src Inst LM을 못이김으로 SFT 데이터 자체 제작

◦

기존의 선별된 데이터셋들과 새롭게 개발한 합성 데이터셋을 신중하게 결합

▪

Magpie-Ultra: 대화형 데이터셋

→ Llama-3.1-405B-Instruct-FP8를 활용해 three-turn dataset 생성

→ Llama-3.1-8B-Instruct, Llama-Guard-3-8B 활용하여 filtering

▪

특수 목적 데이터셋들:

•

Smol-Constraint

→ 상세한 제약 조건을 포함한 지시 수행 능력 향상

→ Qwen2.5-72B-Instruct를 사용하여 550,000개의 Q,A 생성 → filter → 36K

•

Smol-Rewrite

→ diverse collection of emails, tweets, LinkedIn posts, and notes using PersonaHub (Ge et al., 2024) and personas from the FinePersonas dataset

→ prompted Qwen2.5 72B-Instruct to rewrite the given texts

•

Smol-Summarization

→ synthesize diverse collection of emails, tweets, LinkedIn posts, and notes using PersonaHub and personas from the FinePersonas dataset

→ prompted Qwen2.5 72B-Instruct to summarize the given texts

•

MATH DATA

→ NuminaMath-CoT: MATH와 MT-Bench에서 강력한 성능 입증

→ MetaMathQA: GSM8K에서 결과 향상 (OpenHermes2.5에도 포함됨)

+ Self-OSS-Starcoder2Instruct 50K

+ SystemChats2.0 80K

+ LongAlign 3.7K

+ OpenHermes2.5 100K ⇒ 지식 및 일상 대화 능력 강화

+ Everyday-Conversations (2,200개의 캐주얼 다중 턴 상호작용) ⇒ 지식 및 일상 대화 능력 강화

+ ExploreInstruct ⇒ 지식 및 일상 대화 능력 강화

statistics

[Post-training - Alignment]

•

써본 것중에 UltraFeedback이 가장 좋았음

•

We trained for 2 epochs with a learning rate of 1.0 × 10−6, beta of 0.5, global batch size of 128, and sequence length of 1024 tokens.

⇒ short context DPO가 long context ability에 영향을 미치지 X

•

Qwen2.5-1.5B (18 trillion tokens)에 비해 적은 token을 태웠음에도 최종 Inst LM이 (1) instruction following (2) math,code에서 comparable한 성능을 달성

5. Conclusion

•

SmolLM2 135M, 360M 학습도 진행했고 관련 과정도 제시함

•

open small LMs을 학습하기 위한 dataset curation, pre-training 방법론을 공개함

•

품질 좋은 데이터를 뒤에, 강한 필터링은 당연하게도 중요한 요소. 

◦

LM annotating으로 학습한 classifier로도 충분히 괜찮은 filter를 만든게 고무적

•

작은 모델을 성공적으로 학습하기 위해선 취할건 취하고 (코드에서 15개 언어 선별적으로), 그 데이터를 최대한 많은 토큰에 학습시키기