Vision-Language-Action (VLA) models in robotics combine visual perception, natural language understanding, and action planning into a single unified framework. These models allow robots to interpret instructions, understand their environment, and perform tasks in a more intuitive, human-like manner. In this tutorial, we focus on Pi0, a VLA model that integrates observations, textual prompts, and robot joint states to predict actions effectively for robotic tasks.
This tutorial walks you through the end-to-end workflow of training, and simulating a Pi0 model for robotic manipulation in Ark.
Make sure ark_framework, ark_types and ark_ml are installed
pip install -e git+https://github.com/Robotics-Ark/ark_framework.git#egg=ark_framework
pip install -e git+https://github.com/Robotics-Ark/ark_types.git#egg=ark_types
pip install -e git+https://github.com/Robotics-Ark/ark_ml.git#egg=ark_ml
We assume that you have a properly configured global configuration file and an Input/Output schema defining the communication channels. Reference configuration files are provided in the Examples folder. Additionally, you need to set up the simulator with the appropriate objects and camera settings, please refer to the main documentation for detailed instructions.
ark_ml/arkml/examples/franka_pick_place/franka_config/global_config.yaml
ark_ml/arkml/examples/franka_pick_place/franka_config/default_io_schema.yaml
Start Registry
Ensure that the Ark registry service is running, as it is required for managing nodes and communication.
ark registry start
Start / reset the simulator
Launch the simulator or reset it to a clean state before starting a new session.
python sim_node.py
Start Text Repeater Node
Run the Text Repeater Node to send textual commands (prompts) to the VLA, which are then published to the appropriate channel.
python ark_framework/ark/tools/language_input/text_repeater.py --node-name test_input --config ark_ml/arkml/examples/franka_pick_place/franka_config/global_config.yaml
You can set the prompt text in the text section of the global_config.yaml
text: "Pick the yellow cube and place it in the white background area of the table"
Start Policy Node
Start the Policy Node, which will process observations and predict the robot’s actions.
CUDA_VISIBLE_DEVICES=0 python -m ark_ml.arkml.tools.policy_service algo=pizero algo.model
Start Simulator Environment
Finally, launch the Simulator Environment to connect all components and begin running simulations.
python ark_ml/arkml/examples/franka_pick_place/franka_pick_place.py --max_step 200 --n_episodes 1 --step_sleep 0.1
There are a number of arguments you can configure in the simulator environment
Arguments | Use |
---|---|
--n_episodes |
Total number of episodes to run. |
-max_step |
Maximum number of steps to execute per episode. |
-policy_node_name |
Name of the policy node that the environment will communicate with. |
-config_path |
Path to the global configuration file to load settings from. |
-step_sleep |
Time delay (in seconds) between each step. |
In this mode, the environment must explicitly request the policy node to predict the next action at every step.
To enable service mode, set policy_mode
to service
in the global_config.yaml
file.
policy_mode: service
In this mode, the environment only needs to send a request to start the action sequence.
Once started, the policy node continuously receives observations, predicts the next actions, and publishes them to the corresponding channel.
This loop continues until the environment sends a request to stop or reset the service.
To enable service mode, set policy_mode to steppe in the global_config.yaml file.
policy_mode: stepper
This is a short guide that explains how to prepare data, configurations and fine tune Pi0 model
To fine-tune the Pi0 model, the dataset must follow a specific format. Each trajectory should be stored as a pickle file containing joint states, images, and the corresponding text prompt. At present, the model expects the joint state to be available at index 6 within the trajectory data.
Note: In the upcoming release, data collection and training will be automated to remove this overhead and simplify the fine-tuning process.
{"state":state, "action":action, "prompt":prompt}
state = [cube position, target potition,joint states, end effector, images]
You can configure the following settings through the command line. To set model arguments, use the format algo.model.<argument>
. For example, use algo.model.obs_dim
to set the observation dimension.
model:
obs_dim: 9
action_dim: 8
image_dim: (3, 480, 640) # Image dimension (c,h,w)
visual_input_features:
- image_top
trainer:
lr: 2e-4
batch_size: 8
max_epochs: 10
num_workers: 2
use_bf16: true
weight_decay: 0.0
Use your prepared dataset to fine-tune the model on task-specific data. This allows the model to adapt its predictions based on the joint states, images, and prompts provided, improving performance for your specific application.
CUDA_VISIBLE_DEVICES=0 python -m ark_ml.arkml.tools.train algo=pizero data.dataset_path=/pathn/to/trajectories/ output_dir=/output/path