Qwen3 Technical Report & Group Sequence Policy Optimization (GSPO)

0. Abstract

•

huggingface에 올라온 수많은 Qwen3 Kingdom

◦

Qwen3-2504: 가장 초기 버전의 Qwen3. reasoning과 non-reasoning mode를 hybrid형태로 제공.

◦

Qwen3-2507: instruct model과 reasoning model을 분류학습해서 제공. 이 버전부터 GSPO 적용된 것으로 보임.

◦

Qwen-Next: total parameters와 context lengths 증가를 위해 hybrid attention, multi-token prediction 등 지원.

◦

Qwen-Plus/Max: Chinese/English 능력 향상 + model parameter 증가 (1 trillion parameters and was pretrained on 36 trillion tokens.)

→ Max는 서비스용으로 공개 안되는듯

•

리뷰할 TR은 Qwen-2504 공개시 작성된 report 및 GSPO 관련 내용

Technical Report

1. Introduction

•

Qwen3가 기존과 달리진 부분은 아래와 같다.

- thinking mode와 non-thinking mode를 hybrid하게 이용할 수 있는 single model

- thinking token budget control이 가능한 model (user 편의대로 performance-cost trade off 조정 가능)

- 36 trillion tokens로 pre-trained / 119 언어와 dialect 기능 제공

- multi-stage post-training을 통해 reasoning mode 통합 training, if 성능 강화

2. Architecture

•

Model Card

◦

Dense (6): Qwen3-0.6B,Qwen3-1.7B,Qwen3-4B,Qwen3-8B, Qwen3-14B, Qwen3-32B

◦

MoE (2): Qwen3-30B-A3,  Qwen3-235B-A22B 

•

Architecture

◦

Qwen2.5와 거의 유사한 architecture 사용

◦

Grouped Query Attention, SwiGLU, Rotary Positional Embeddings, RMSNorm withpre-normalization.

◦

Qwen2에서 사용된 QKV-bias대신 QKV-Norm 사용

(QKV-Norm: Scaling Vision Transformers to 22 Billion Parameters에서 ViT의 efficiency와 training stability를 향상시키기 위해 제안된 3가지 modification중 1개)

- xxxx steps 이후에 ViT의 training loss가 튀어서 확인 → attention score가 one-hot vector로 수렴해서 near-zero entropy가 있는 layer 존재

- Q,K에 layer norm을 적용함으로 calibration 효과 (Qwen3은 Value에도 적용한듯)

◦

MoE architecture에 대한 자세한 기술은 없지만 Qwen2.5-MoE에서 사용한 shared expert를 제외했다고 함; downstream task 성능 향상에 기여

▪

128 total experts with 8 activated experts per token

- shared expert: DeepSeekMoE등에서 사용된 technique로 top-k FFN()을 제외하고 고정적으로 활성화되는 expert FFN()을 두는 구조

(아래 그림을 보면 1번 FFN()은 항상 고정적으로 output hidden을 만드는데 기여함)

◦

MoE 학습 안정화를 위해 Load Balancing Loss(LBL) 추가; downstream task 성능 향상에 기여

- MoE를 naive하게 학습하면 특정 expert에만 토큰이 과도하게 라우팅되는 expert collapse 현상이 발생

- LBL: 특정 expert에에게 과도하게 집중되는 것을 패널티하는 loss

▪

fif_ifi​: expert EiE_iEi​가 받은 토큰의 비율 (fraction of tokens)

▪

PiP_iPi​: expert EiE_iEi​가 할당받은 라우팅 확률 (routing probability)

▪

NEN_ENE​: 전체 expert 수

- model, pipe parallel이 필연적이기에 parallel group 내부에서만 LBL을 계산하고, 이후 all-gather로 평균내기도 함

▪

NPN_PNP​: 전체 parallel group 수

▪

fˉ=1NE\bar{f} = \frac{1}{N_E}fˉ​=NE​1​: 균등 분배 시 이상적인 값

▪

Pˉ=1NE\bar{P} = \frac{1}{N_E}Pˉ=NE​1​

⇒ 직관적으로 해석하면 router가 약속한 확률 분포대로 실제 토큰이 어느정도는 배분되도록 유도, 실제 구현에서는 정규화와 스케일링을 추가해서 collapse를 드러내도록 설계

▪

균등 분배일 때: f=P=[0.5,0.5]f=P=[0.5,0.5]f=P=[0.5,0.5]

→ 0.5/0.5 * 0.5/0.5 + 0.5/0.5 * 0.5/0.5 = 1.0

▪

collapse일 때: f=[0.9,0.1],P=[0.5,0.5]f=[0.9,0.1], P=[0.5,0.5]f=[0.9,0.1],P=[0.5,0.5]

→ 0.9/0.5 * 0.5/0.5 + 0.1/0.5 * 0.5/0.5 = 2.0

⇒ 한 parallel group이 특정 도메인 데이터에 특화되었다고 해도, router는 여전히 “균등 분배”를 시도하도록 수식이 설계 → 특정 전문가가 특정 도메인에 특화되는 것을 학습적으로는 방해 → domain specific expert를 만들기 보다는 토큰 단위의 routing 패턴만 학습하기 위함 (결론은 efficiency 향상이 최종 목적)

◦

BPE 계열의 BBPE tokenizer 사용

▪

151,669 vocab dim

3. Pre-training

Pre-training Data

•

Qwen.2.5 대비 pre-training token 2배이상, 지원 가능한 언어 3배 이상

◦

pre-training tokens: 36T

◦

languages: 119

•

Qwen2.5-VL (extraction) → Qwen2.5 (rewrite)를 통해 PDF-like document에서 T단위 데이터 확보

•

small proxy models에 여러 data-mixture를 가지고 실험해서 data-mixture optimization을 진행했다고 함

Pre-training Stage (1→2→3)

General Stage: 4096 context length로 30T (전체의 5/6) pre-training token 학습

Reasoning Stage: STEM, coding, reasoning, synthetic data 비율을 높혀 HQ data 5T 학습 (4096 context length 유지)

Long Context PT: 75% (16,384-32,768 tokens) & 25% (4096-16,384 tokens) 비율을 맞춘 1T 안되는 token을 학습

(자세히 보지는 못했지만 다음과 같은 기법이 적용되었다고 함)

- Adaptive Base Frequency: base frequency 값을 더 크게 바꾸어서 (10,000→1,000,000), 더 긴 시퀀스에 대해서도 RoPE의 상대 위상(회전 각도)이 덜 extrapolate되도록 함

- YaRN: (1 )RoPE 차원을 부분적으로 나누고 보 또는 scaling을 하는 전략 (2) attention softmax 전에 temperature

t

를 곱해주는 scaling을 도입해서, 더 먼 거리에 있는 토큰들이 attention에서 완전히 무시되지 않도록 조절

- Dual Chunk Attention: 세 종류의 attention을 조합 (1) Intra-chunk attention: 같은 chunk 내부 토큰들 사이의 attention (2) Successive-chunk attention: 인접한 chunk 간의 토큰 관계를 반영 (3) Inter-chunk attention: 비인접(멀리 떨어진) chunk 간의 관계도 일부 캡처

Pre-training Evaluation

•

General KG, reasoning, math, science, coding, multilingual 위주

- 기존 SOTA dense, MoE base models (DeepSeekV3 Base, Llama-4-Maverick Base, and Qwen2.5-72B-Base)대비 Qwen3-235B-A22B-Base가 적은 total params, active params로 더 좋은 성적

- 동일한 PT token을 봤음에도, Qwen3-30B-A3B가 Qwen-14B 대비 1/5적은 active params로 on-par 성능

- 이전의 Qwen2.5-3B/7B/14B/32B/72B-Base에 대응되는 Qwen2.5-3B/7B/14B/32B/72B-Base가 더 적은 total params으로 좋은 성능 (2배 넘는 token을 학습했으니 당연..)

4. Post-training

Qwen3 Post-training의 전략적 목표:

Thinking Control: think mode, non-think mode의 조합

Strong-to-Weak Distillation: small scale model의 효과/효율적인 학습

(student model을 4 stage에 거쳐서 학습하는것보다 (1) 높은 pass@1/64 score (2) 1/10 GPU hours)

Long-CoT Cold Start (SFT)

- 모델에게 CoT 능력을 주입시키기 위한 단계. query/answer filtering으로 dataset 고도화 → SFT

Query Filtering (Qwen2.5-72B-Instruct)

•

not easily verifiable / multiple sub-questions / text generation query 제거

•

Qwen2.5-72B-Instruct가 CoT 없이 답하는 query 제거

•

domain labeling (Qwen2.5-72B-Instruct) → dataset curation 

(어떤 taxonomy 썼는지는 공개 x)

Answer Filtering (QwQ-32B-Instruct)

•

N response generation, pass@N이 성공적인 (q,a)에 대해서 filtering 진행

•

repetition / guesswork w/o adequate reasoning / inconsistencies between thinking  final answer / language mixing generation / overly similar to valid set (valid answer가 있는 set에 대해서 진행한 듯?)

→ (추측상) 이 작업은 미국 foundation model로 했을거 같음

- 최종 모델 기준으로 RULER (retrieval generation)에 평가

- thinking mode (budget=8192)는 context 길어질수록 성능이 조금씩 하락 (Qwen-32B 64/128K; non-think vs. think)

Reasoning RL (GRPO)

•

SFT단계에서 사용하지 않고 / 어렵고 (model-based filter) / Cold Start RL에 바로 사용가능한 query-verifier (answer/test-case/etc) set 구성

•

3,995 sample로 GRPO 진행

•

large batch size / high number of rollouts query / off-policy training이 학습 효율에 도움이 되었다.

(off-policy training은 매 training-step마다 policy model checkpoint를 교체하지 않았다는 것 같음)

•

Qwen3-235B-A22B이 170 RL training step 이후 AIME24에서 70.1 → 85.1

Thinking Mode Fusion (SFT w and w/o thinking mode)

SFT data construction

•

Thinking Data: Long-CoT Cold Start의 query→ Reasoning RL Model로 answer generation → rejection sampling

•

Non-Thinking Data: coding, mathematics, instruction-following, multilingual tasks, creative writing, question answering, role-play dataset curation / answer verifier는 알아서 잘한듯 / low-resource language capabilities 향상을 위해 translation task 비율 증가

Chat Template Design

•

default가 reasoning model

•

multi-turn dialog에서는 /think, /no_think 혼용해서 데이터셋 구상

•

HF chat template로 control 가능

Thinking Budget (설명이 불명확함)

•

thinking budget에 다다르면 “Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n\n”.를 넣어서 thinking budget 끝내고 답변 생성하도록 함 (dataset에 몇개 추가했을듯)

•

“Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n\n”. 나오기 전까지의 reasoning step을 바탕으로 final answer generation하는 현상이 관찰되었다고 함 

General RL (RLHF / preference alignment)

- 20개가 넘는 task에 대해서 reward system 구축

(LLM-as-Judge, Reward Model, Verifier Reward 등 다양하게 사용한 것으로 사료) * LLM-as-Judge에서 Qwen2.5-72B-Instruct에게 reference answer주고 model output 평가했다고 함 * 아마 api model도 Judge로 100% 사용했을듯

•

Instruction Following: content, format, length, use of structured output, human alignment에 대한 reward signal

•

Format Following: /think, /no_think에 대응되는 생성하는지, <think>, </think> tag 올바르게 생성하는지

•

Preference Alignment: RLHF (helpfulness, engagement, style)

•

Agent Ability: 지정된 인터페이스를 통해 tool을 올바르게 호출하는지, 실제 environment execution에서 multi-turn 학습을 진행해 reward signal을 줌

•

Abilities for Specialized Scenarios: RAG task에 대한 rewarding system도 구축해서 학습

Strong-to-Weak Distillation

- student model: Qwen3-0.6B, 1.7B, 4B, 8B, 14B, and Qwen3-30B-A3B - teacher model: Qwen3-32B and Qwen3-235B-A22B

•

Off-policy Distillation: teacher로  /think, /no_think answer generation → student SFT

•

On-policy Distillation: teacher/ student로  /think, /no_think answer generation → KLD training

Post-Training Evaluation

- thinking mode: temperature: 0.6 / top-p: 0.95/ top-k: 20 (creative thinking: presence penalty: 1.5)

- non-thinking mode: temperature: 0.67/ top-p: 0.8/ top-k: 20 / presence penalty: 1.5

Qwen3-235B-A22B

Qwen3-32B

- Qwen3-235B-A22B는 DeepSeek-R1, DeepSeek-V3보다는 높은 thinking, non-thinking score * DeepSeek-R1의 60% activate params, 35% total params로 17/23개에서 높은 성능 * gpt-4o-2024-07-18보다 18/23개에서 좋은 non-thinking 성능

- Hybrid Model임에도 Qwen-3 32B가 QwQ-32B보다 높은 reasoning 성능 (17/23)

Lightweight Model

- - Qwen3-30B-A3B와 Qwen3-14B의 Thinking 성능이 params 수가 큰 QwQ-32B보다 높음

- student baselines (DeepSeek-R1-Distill-Qwen-14B, DeepSeek-R1-Distill-Qwen-32B)보다 좋은 성능 ; 2배 넘는 pre-training token, KLD가 추가되었으니 당연한 결과일수도!

- Qwen3-8B / 4B / 1.7B / 0.6B에서도 동일한 결과

Discussion

Effectiveness of Thinking Budget

- math, code, STEM에서 thinking budget 증가할수록 성능 증가 (32K이상으로 두면 성능향상 더 있을거라고 예상)

Effectiveness and Efficiency of On-Policy Distillation * off-policy distill을 완료한 8B checkpoint에서 실험

- 작은 모델의 경우 직접 RL하는것보다 Distill하는게 성능도 높고 GPU Hours면에서 10배 더 효율적

- AIME24,25에서 RL보다 Distill의 pass@64가 더 높은데, 논문에서는 distill하는게 효과적인 exploration 능력을 주입했다고 주장

Effects of Thinking Mode Fusion and General RL * thinking mode와 General RL 능력을 평가하기 위해 각 stage의 ckpt를 public/in-house benchmark에 측정 * (1) CountFactQA (hallucination) (2) LengthCtrl (length control & generation) (3) ThinkFollow (\think \no_think usage in multi-turn) (4) ToolUse (intent recognition, format accuracy, parameter accuracy in single/multi-turn tool calling)

- thinking mode fusion 이후 88.7%의 thinking switching ability, general IF 성능도 향상

- general RL이후 thinking/non-thinking ability 동시 상승 (후자 상승이 신기하긴 함)

- Knowledge, STEM, Math, Coding의 경우 stage3,4 이후 thinking 성능 변화가 없거나 일부 하락

( verifiable reward가 관련 지식을 학습시키는데 있어서 가장 중요한듯 / model의 versatility를 위해서 어쩔 수 없는 trade-off인듯)

6. Future Work

- scaling pre-training data (quality & diversity)

- efficient compression/generation → MTP

- environmental feedback for agent system

Group Sequence Policy Optimization (GSPO)

1. Introduction

•

성공적인 RL 학습을 위해서는 stable and rubust tranining dynamics가 보장되어야 하는데 model size를 scale할수록 GRPO는 불안정해진다고 함

•

GRPO의 문제점을 지적, 이를 개선한 GSPO 제시

2. Preliminaries

•

Likelihood of y conditioned by xxx: πθ(y∣x)=∏t=1∣y∣πθ(yt∣x,y<t)π_θ(y|x) = ∏^{|y|}_{t=1} π_θ(yt|x,y<t)πθ​(y∣x)=∏t=1∣y∣​πθ​(yt∣x,y<t)

•

Verified reward rrr: r(x,y)∈[0,1]r(x,y) ∈ [0,1]r(x,y)∈[0,1]

Proximal Policy Optimization (PPO)

•

importance ratio of yty_{t}yt​: wt(θ)=πθ(yt∣x,y<t)πθold(yt∣x,y<t)w_{t}(θ) = \frac{π_θ(y_t|x,y_{<t})}{π_{θold} (y_t|x,y_{<t})}wt​(θ)=πθold​(yt​∣x,y<t​)πθ​(yt​∣x,y<t​)​

•

old policy의 근접 영역 내에서 policy update를 제한

•

value model이 normalizer 역할을 해주는데 PPO 학습을 위해 필연적 (policy model이랑 똑같은 size 필요)

Group Relative Policy Optimization (GSPO)

•

각 response의 relative advantage를 구함으로 value model의 필요성 제거

•

yiy_iyi​내에 있는 모든 token이 Ai^\hat{A_i}Ai​^​를 공유

3. Motivation

•

large model size / sparsity (MoE) / long response → large rollout batch (max. hardware utils)

•

Importance sampling

f

에 대한 expectation을 구해야하는데

\pi_{tar}

로 샘플링할 수가 없어서 target 분포

\pi_{tar}

하에서의 기대값을 behavior 분포

\pi_{beh}

sample로 근사.

- large batch size인 경우 mini-batch로 쪼개서 update를 하는데

\pi_{beh}

를 mini-batch에 일률적으로 적용 → off-policy discrepancy → PPO/GRPO가 cliping을 쓰는 이유 (

\pi_{tar}

랑

\pi_{beh}

가 너무 멀어지지 말라고)

•

Optimization-Reward Unit Mismatch

•

importance sampling 수식을 보면 key는 instance 단위(LLM 입장에서는 sequence)를 맞춰야 하는거

•

GRPO는 importance weight를 1개 token 단위로 계산 (wi,t(θ)w_{i,t}(\theta)wi,t​(θ)) → reward는 여전히 sequence 단위 (Ai,t^=Ai^\hat{A_{i,t}}=\hat{A_{i}}Ai,t​^​=Ai​^​)라 불일치 → token마다 별도로 적용하면서 한 token에서의 sample 변동성을 그대로 gradient에 반영 → sequence가 길어질수록 토큰별 ratio가 쌓임 → 불필요한 high variance

•

GRPO의 clipping 메커니즘이 이 문제를 악화 → 실험결과 training 도중에 collapse 발생해도 복구 불가

한줄 정리: optimize는 token 단위로 clipping 적용하는데, reward는 sequence단위로 적용

4. Algorithm

•

Sequence-Level로 Reward를 주기 때문에 importance weight도 sequence단위로 주기

(

\frac{π_θ(y_i,t|x,y_{i,<t})}{π_{θold} (y_i,t|x,y_{i,<t})}

⇒

\frac{π_θ(y|x)}{π_{θold} (y|x)}

)

GSPO

•

 sequence likelihood 비율을 length로 정규화해서 사용

•

clipping도 토큰이 아니라 sequence 단위로 적용 → 지나치게 off-policy한 token이 gradient에 영향을 주는 것을 막음

gradient analysis; clipping is omitted for brevity

•

Gradient analysis를 살펴보면 GSPO와 GRPO의 차이는 token의 loglikelihoods gradient에 어떻게 weight를 줄까의 차이

•

GRPO는 token마다 다른 importance weight를 각 token gradient에 곱해주나 (sequence가 길어질수록 clipping이 계속 적용되면 unstable training) GSPO는 동일한 importance weight를 token gradient에 곱해줌

GSPO-Token

•

multi-turn RL 같이 scenario setting으로 흘러가면 token별 reward를 design해야하는데 GRPO/GSPO 둘다 그걸 고려한 수식은 아님

•

si(θ)s_i(\theta)si​(θ) 값은 그대로 유지하면서도, gradient가 토큰별 확률에 흘러가도록 설계됨.

•

importance ratio는 여전히 sequence-level 안정성을 주지만, gradient는 token-level로 세분화 가능

증명

•

Ai,t^\hat{A_{i,t}}Ai,t​^​가 t별로 다를때 token gradient에 서로 다른 reward를 흘릴 수 있음 (Ai,t^=Ai^\hat{A_{i,t}}=\hat{A_i}Ai,t​^​=Ai​^​일 경우엔 Ai,t^\hat{A_{i,t}}Ai,t​^​가 Σt\Sigma_tΣt​밖으로 나오기 때문에 GSPO랑 동일한 수식)

•

sequence-level importance ratio를 공유하면서도 token-level reward 반영 가능 → variance 완화 + fine-grained reward 동시 가능

5. Experiments and Discussion

•

Setting

◦

base model: Qwen3-30B-A3B-Base

◦

clipping hyperparameter 조정

▪

GSPO: clipping range를 (3e-4, 4e-4)로 설정

▪

GRPO: clipping range를 (0.2, 0.27)로 설정

◦

rollout 데이터를 4개의 mini-batch로 나눠서 gradient update

Routing Replay

•

 MoE로 학습시 expert-activation 변동성이 굉장히 큰 문제 (GRPO는 이걸 더 강화시킴)

expert-activation volatility of MoE models

같은 입력에 대해,

- rollout 단계: old policy가 어떤 expert들을 선택해서 토큰을 생성

- update 단계: new policy가 likelihood를 계산할 때는 다른 expert들이 활성화될 수 있음

(e.g., RL step (gradient update)이 지날 때마다 약 10% expert가 바뀜)

→ token-level로 importance ratio 계산하는 GRPO 사용시 model이 deeper해질수록 stability 악화

•

old policy에서 activate한 expert를 caching하고 new policy likelihood 계산할때 그대로 사용

rollout expert set과 update expert set을 동일하게 맞춤

•

GRPO의 학습 안정성 개선

Routing Replay의 문제점

•

메모리/통신 오버헤드 증가

•

모델이 본래 MoE 구조가 가진 capacity(전문가 다양성 활용)를 제한

•

GSPO는 Routing Replay 사용하지 않고도 안정적 학습 가능

•

개별 token 수준의 expert 선택에 민감하지 않고 sequence-level likelihood different를 사용 → 학습 안정

(어떤 expert set을 사용하든 최종적인 sequence의 likelihood는 fluctuate X)

Clipping

•

GRPO
- clipping fraction: 작음 (구간 안에 잘 들어옴)
- variance: 큼 (token 단위 노이즈 누적 → gradient noisy)

•

GSPO
- clipping fraction: 큼 (tail이 커져 잘림 많음)
- variance: 작음 (sequence-level이라 token noise 평균화됨 → gradient 안정적)

⇒ "많은 샘플을 버리면 학습 효율이 떨어지지 않다”가 항상 맞말은 아님 (estimator의 안정성이 더 중요)

Benefits of GSPO

•

실제 학습시 training engine (e.g., Megatron)과 inference engine (e.g., vLLM, SGLang)을 따로 사용 → precision mismatch 문제 발생

- rollout을 inference engine으로 뽑아오면 likelihood(

π_{old}(y|x)

)가 약간 다르게 계산될 수 있음.

- GRPO는 token-level likelihood가 필요해서 이 작은 오차에도 민감

- 그래서 보통은 training engine에서 다시 likelihood를 재계산 → compute overhead

•

GSPO: equence-level likelihood만 사용
- sequence log-likelihood가 token log-likelihood보다 precision discrepancy에 덜민감
- inference engine에서 얻은 likelihood를 그대로 학습에 사용

⇒ rollout compute와 training compute의 분리

6. Conclusion

•

importance sampling의 기본 원리를 따른 sequence likelihood를 통해 학습 안정 + 성능 향상

Reference

arXiv.orgQwen3 Technical Report

arXiv.orgGroup Sequence Policy Optimization

arXiv.orgScaling Vision Transformers to 22 Billion Parameters

arXiv.orgQwen2.5 Technical Report

arXiv.orgDeepSeekMoE: Towards Ultimate Expert Specialization in...

arXiv.orgSwitch Transformers: Scaling to Trillion Parameter Models with...