LATENT ACTION PRETRAINING FROM VIDEOS

1. Introduction

•

Vision-Language-Action Models (VLA)는 Vision Encoder를 align 시키는 방식으로 학습이 이루어 진다.

•

이런 VLA를 fine-tuning시키려면 human teleoperation (원격조작, human labeling) dataset이 필요하기 때문에 large scale로 확장하기 제한적이다. 이를 극복하기 위해 video data를 활용하는데 이는 2가지 한계점이 존재한다.

video data에는 action labels이 없다.

data distribution mismatch (web data가 embodiments/ robot environment랑 호환이 안된다.)

⇒ 논문에서는 robot action labels없이 VLA를 학습할 수 있는 방법론을 제시

⇒ 연구 결과 LAPA는 특히 cross-environment와 cross-embodiment scenario에서 action policies를 학습하는 기본 방법보다 훨씬 뛰어난 성능을 보였습니다. (action label이 부족한 학습 환경에서)

2. Related Works

VLA

•

VLA란 VLM위에 robotic action data로 fine-tuning을 수행해 다양한 물리적 동작을 수행하게 하는 model.

•

reasoning path, conversational-style instruction dataset등을 추가해서 성능 향상시키는 노력이 이어지고 있지만 labeled action data를 요구하기 때문에 scale law를 실현하기에는 한계가 존재

Training Robot Policies From Videos

•

video는 dynamics information은 data이기에 기본적으로 robotics training에 매우 적합

→ 하지만 action마다 labeling이 되어있지 않는게 가장 큰 한계점

→ 선행 연구들은 video로부터 visual prior를 추출하는데 중점을 두었다고 함

•

특정 motion을 학습시키기 위해 human motions을 robot motions retargeting해 fine-tuning

→ generalize시키기 어려움

•

action label이 있는 small-scaled robotic dataset를 활용해 (1) ‘inverse dynamic model (IDM), optical flow, or reinforcement learning models’을 학습 후 (2) actionless data에 labeling한 후 (3) robot policy 학습’

Latent Actions

•

이전 연구들은 ground-truth actions → latent로 변환하는 procedure를 학습해 multimodality and task semantics를 capture하고자 하였음

→ our work은 observations에서 directly하게 latent action을 mapping하고자 함

3. LAPA: LATENT ACTION PRETRAINING FOR GENERAL ACTION MODELS

•

LAPA는 2단계 (엄연하게는 3단계로 나누어짐)

VQ-VAE based objective로 연속된 비디오와 discretized latent delta information사이의 관계 학습

Image/NL Instruction이 주어졌을때 encoder 학습

small number of ground-truth action-labeled trajectories에 fine-tuning

3.1 LATENT ACTION QUANTIZATION (VQ-VAE)

•

enc-dec 구조로 

◦

enc: current frame xtx_txt​ and the future frame xt+Hx_{t+H}xt+H​를 넣었을 때 latent action ztz_tzt​ 예측

◦

dec: latent action ztz_tzt​ 과 current frame xtx_{t}xt​를 넣었을 때 xt+Hx_{t+H}xt+H​ 예측

→ latent action과 current frame간 attention

→ latent action space로는

s

sequences from

|C|

codebook 사용

→ 학습때는 2 frame images만 사용

3.2 LATENT PRETRAINING

•

current frame xtx_txt​ →latent action space ztz_tzt​ 

◦

Pinverse−dynamics(zt∣xt,xt+1)P_{inverse-dynamics}(z_t | x_t, x_{t+1})Pinverse−dynamics​(zt​∣xt​,xt+1​)

•

pretrain a VLM to predict the ztz_tzt​ given the language instruction of a video clip and the current image xtx_txt​.

→ lmhead대신 codebook space를 가진 head를 사용

→ vision encoder freeze & llm training

→ 기존 방법론들과 달리 action에 hierarchy/granularity을 직접 학습하는게 하니라 연속적인 frame 사이에서 발생하는 ‘delta’를 llm에서 학습시키는 방식

•

delta: 두 연속된 관찰(영상 프레임) 사이의 변화 또는 차이를 의미합니다. 즉, 어떤 상태(state)에서 다음 상태로의 변화량

•

delta end-effector: 한 시점에서 다음 시점으로 이동할 때 엔드 이펙터의 위치나 자세에서의 변화량 (로봇의 말단 부분(e.g., 팔)이 현재 위치에서 얼마나, 어떤 방향으로 움직여야 하는지를 나타내는 값)

3.3 ACTION FINETUNING

•

3.2에서 학습한 VLM을 ‘ground truth actions (delta end-effector)’을 가진 dataset을 가지고 fine-tuning

•

target action은 다른 연구들과 마찬가지로 ‘discretize the continuous action space → each dimension of the robot’

→ discard latent action head & init action head

→ vision encoder freeze & llm training

4. Experiments

•

Dataset & Training Configuration

→ 논문에서는 아래의 Pretraining과 Fine-tuning setting을 산정하고 다음의 Question에 대한 해답을 제시하고자 함.

How does LAPA perform when there are cross-task, cross-environment, and crossembodiment gaps between pre-training and fine-tuning?

(pre-training/fine-tuning distribution mismatch가 존재할 때 극복이 가능한가?)

•

Cross-Task: Standard ML setting

•

Cross-Env: data distribution mismatch

•

Cross-Emb/Multi-Emb: data distribution mismatch or similarly match

Can LAPA learn superior priors compared to using ground-truth actions during pretraining in a multi-embodiment setting? 

•

LAPA는 ground-truth actions을 가지고 pre-training을 하지 않는데 그만큼의 prior을 학습할 수 있는가?

Can we create a performant LAPA solely from raw human manipulation videos?

•

LAPA가 human video로도 latent prior를 잘 만들까?

•

Baselines

◦

SCRATCH: fine-tuning만 한 모델

◦

UNIPI: diffusion model로 video rollouts 생성 → fine-tuning시 IDM으로 DOF extract

◦

VPT: ground truth label로 IDM fine-tuning

◦

ACTIONVLA: backbone VLM으로 pre-training때부터 action label로 학습한 VLA

LANGUAGE TABLE RESULTS

→ In-Domain Performance: upperbound까지는 아니지만 LAPA의 latent action 생성의 Effectiveness를 검증 가능

→ Cross-Task: fine-tuning을 1개 (7K)로 하고 각각 5 task의 평균을 제시. VPT가 task fine-tuning 성능의 이점 덕분에 score가 높지만 LAP도 높은 성능 유지

→ Cross-ENV: 아래 그림과 같이 real2sim setting으로 실험. upperbound에 못미치지만, UNIPI와 VPT는 transfer 성능가 안되는데 비해 LAPA는 고무적

SIMPLER RESULTS

•

Bridgev2 → 4tasks (100 trajs)

→ UNIPI의 경우 100개의 데이터만으로 7DOF를 정확하게 만들어서 평가시키는데는 한계가 있음

결국 insufficient action-labeled data에서는 action-label only를 fine-tuning해서 성능을 개선하는데에는 한계가 존재

→ ACTIONVLA가 60K action-labeled trajectories를 직접 pre-training해 action space 정보를 학습시킨거에 비하면 latent space로 generalized training이 가능할 수 있음을 시사함

REAL-WORLD RESULTS

•

Bridgev2 & Open-X로 pre-training해서 로봇팔 움직이는 실험 진행

→ Bridge:

•

LAPA가 action-label을 사용하지 않았음에도 OpenVLA보다 더 좋은 성능을 보임

•

Bridge에 Pick and Place가 많아서 LAPA가 성능이 잘나왔다고 주장할 수 있음.

◦

근데, 그러기 위해서는 action-label을 써야하고, 실제로 그럴경우 WidowX action space에 overfitted됨

◦

LAPA는 latent action space로 training하기 때문에 안정적인 transfer learning 가능

→ OpenX:

•

dataset의 scale-law 때문에 전반적으로 좋은 성능

→ bimanual robot with a 14-DoF action space:

•

LAPA (30.21%) vs OpenVLA (26.04%), but still room for improvement.

LEARNING FROM HUMAN MANIPULATION VIDEOS

•

human pretraining (Something-Something V2) → fine-tuning (SIMPLER/ Mainpulate Robots)

→ human videos에는 action label이 없음에도, leveraging human videos for latent action pretraining results in positive transfer.

robust to human to robot embodiment shifts.

→ 수집에 많은 시간이 소요되는 고가의 로봇 조작 데이터와 비교하여 raw human manipulation videos의 잠재력을 강조

ABLATION AND ANALYSIS

→ latent action space을 담당하는 2개의 dimension의 크기를 조정하면서 실험하는게 이해는 안되지만, 둘의 조정해서 최적의 point를 찾아보고자 했음

•

일정 수준이상 scale-law 영향 X

→ 반면 data, model은 scale-law 따르며, 향후 전신단위의 비디오등에서도 효과가 있을것으로 기대

… 이외는 VQ-VAE decoder 정성분석이라 생략 …

5. Conclusion

•

robotics의 data bottleneck인 setting에서 labeling없이 기존 데이터로부터 unsupervised learning으로 VLA training 할수 있는 방법론 제시

•

target action-label을 직접적으로 pre-training에 활용한 방법론에 비해서는 효과가 좋다고 할 수는 없으나, generalize action-label을 전부 pre-training한 VLA (i.e., Open VLA, trained with 970K action-labeled trajectories.)보다는 성능적으로 우위

•

VLA가 풀어야할 다양한 실험 세팅을 설계하고 실험 → 향후 이 문제를 정말로 해결하기 위한 아주 좋은 세팅으로 보임

◦

cross-task

◦

cross-emb

◦

multi-emb

•

limitation에서도 말했지만 채광이 다양하거나 외부 개입이 더 많은 (e.g., 차, 비행기) 환경에서는 검증이 더 필요하지만 이건 조금 이후의 문제로 보임.