Code Docs

Vision-Language-Action (VLA) - Pi0

Implementing Pi0 in Ark

Introduction

Vision-Language-Action (VLA) models in robotics combine visual perception, natural language understanding, and action planning into a single unified framework. These models allow robots to interpret instructions, understand their environment, and perform tasks in a more intuitive, human-like manner. In this tutorial, we focus on Pi0, a VLA model that integrates observations, textual prompts, and robot joint states to predict actions effectively for robotic tasks.

This tutorial walks you through the end-to-end workflow of training, and simulating a Pi0 model for robotic manipulation in Ark.

Pre Requisites

Make sure ark_framework, ark_types and ark_ml are installed

pip install -e git+https://github.com/Robotics-Ark/ark_framework.git#egg=ark_framework
pip install -e git+https://github.com/Robotics-Ark/ark_types.git#egg=ark_types
pip install -e git+https://github.com/Robotics-Ark/ark_ml.git#egg=ark_ml

We assume that you have a properly configured global configuration file and an Input/Output schema defining the communication channels. Reference configuration files are provided in the Examples folder. Additionally, you need to set up the simulator with the appropriate objects and camera settings, please refer to the main documentation for detailed instructions.

ark_ml/arkml/examples/franka_pick_place/franka_config/global_config.yaml
ark_ml/arkml/examples/franka_pick_place/franka_config/default_io_schema.yaml

Rollout Pi0 Model

Start Registry

Ensure that the Ark registry service is running, as it is required for managing nodes and communication.

ark registry start

Start / reset the simulator

Launch the simulator or reset it to a clean state before starting a new session.

python sim_node.py

Start Text Repeater Node

Run the Text Repeater Node to send textual commands (prompts) to the VLA, which are then published to the appropriate channel.

python ark_framework/ark/tools/language_input/text_repeater.py --node-name test_input --config ark_ml/arkml/examples/franka_pick_place/franka_config/global_config.yaml

You can set the prompt text in the text section of the global_config.yaml

text: "Pick the yellow cube and place it in the white background area of the table"

Start Policy Node

Start the Policy Node, which will process observations and predict the robot’s actions.

CUDA_VISIBLE_DEVICES=0 python -m ark_ml.arkml.tools.policy_service  algo=pizero   algo.model

Start Simulator Environment

Finally, launch the Simulator Environment to connect all components and begin running simulations.

python ark_ml/arkml/examples/franka_pick_place/franka_pick_place.py --max_step 200 --n_episodes 1 --step_sleep 0.1

There are a number of arguments you can configure in the simulator environment

Arguments Use
--n_episodes Total number of episodes to run.
-max_step Maximum number of steps to execute per episode.
-policy_node_name Name of the policy node that the environment will communicate with.
-config_path Path to the global configuration file to load settings from.
-step_sleep Time delay (in seconds) between each step.

Policy Node support two modes of operation

Service Mode

In this mode, the environment must explicitly request the policy node to predict the next action at every step.

To enable service mode, set policy_mode to service in the global_config.yaml file.

policy_mode: service

Stepper Mode

In this mode, the environment only needs to send a request to start the action sequence.

Once started, the policy node continuously receives observations, predicts the next actions, and publishes them to the corresponding channel.

This loop continues until the environment sends a request to stop or reset the service.

To enable service mode, set policy_mode to steppe in the global_config.yaml file.

policy_mode: stepper

Training

This is a short guide that explains how to prepare data, configurations and fine tune Pi0 model

Dataset

To fine-tune the Pi0 model, the dataset must follow a specific format. Each trajectory should be stored as a pickle file containing joint states, images, and the corresponding text prompt. At present, the model expects the joint state to be available at index 6 within the trajectory data.

Note: In the upcoming release, data collection and training will be automated to remove this overhead and simplify the fine-tuning process.

{"state":state, "action":action, "prompt":prompt}
state = [cube position, target potition,joint states, end effector, images]

Configurations

You can configure the following settings through the command line. To set model arguments, use the format algo.model.<argument>. For example, use algo.model.obs_dim to set the observation dimension.

model:
  obs_dim: 9 
  action_dim: 8         
  image_dim: (3, 480, 640)  # Image dimension (c,h,w)
  visual_input_features:
    - image_top

trainer:
  lr: 2e-4
  batch_size: 8
  max_epochs: 10
  num_workers: 2
  use_bf16: true
  weight_decay: 0.0

Fine tune Model

Use your prepared dataset to fine-tune the model on task-specific data. This allows the model to adapt its predictions based on the joint states, images, and prompts provided, improving performance for your specific application.

CUDA_VISIBLE_DEVICES=0 python -m ark_ml.arkml.tools.train algo=pizero data.dataset_path=/pathn/to/trajectories/  output_dir=/output/path