Reinforcement Learning (RL)

PPO

We use the Proximal Policy Optimization (PPO) algorithm for training our humanoid robots.

Proximal policy optimization (PPO) is an on-policy reinforcement learning algorithm that aims to maximize a reward function through actor-critic methods. The actor network learns a policy that maps states to actions, while the critic network estimates the value function of those state-action pairs. By using the critic’s value estimates to guide policy updates, PPO can stably optimize the policy to maximize expected future rewards.

A common issue with RL, especially on-policy RL—where the learning algorithm can only use current observations to inform policy updates—is that algorithms tend to be myopic, aggressively optimizing for the temporary environment of the current policy at the expense of stability. PPO improves upon previous policy-gradient methods by introducing a clipped surrogate objective, which limits how far the policy can deviate from previous policies during updates. This clipping mechanism helps prevent catastrophically large policy changes while still allowing steady improvement toward optimal behavior.

PPO is particularly suited for humanoid robotics control tasks. It efficiently handles continuous action spaces, provides stable learning through conservative policy updates, and can effectively maximize complex reward functions that encode desired robot behaviors. The actor-critic architecture allows it to learn sophisticated policies while managing the exploration-exploitation tradeoff through value estimation.

Our PPO implementation follows this high-level algorithm:

While not done:
    Collect experiences using current policy
    If update interval reached:
        Compute advantages and returns
        For each epoch:
            For each mini-batch:
                Compute policy ratio r = new*policy / old_policy
                Calculate L_clip = min(r * advantage, clip(r, 1-ε, 1+ε) _ advantage)
                Compute value loss L_vf and entropy bonus
                Loss = -L_clip + c1 _ L*vf - c2 * entropy
                Perform gradient update
        Clear experience buffer

Currently, our hyper parameters are:

Learning rate: 1e-5 (adaptive)
Entropy coefficient: 0.001
Discount factor (gamma): 0.994
GAE lambda: 0.9
Number of epochs: 2
Number of mini-batches: 4
Actor network: [512, 256, 128] hidden units
Critic network: [768, 256, 128] hidden units

Reward shaping is a very important part of training RL policies. It helps the agent learn faster and more stably, and is generally critical for fulfilling complex tasks. In the following sections, we will discuss the rewards used in our experiments.

Reward Shaping

General Configuration for Standing

A general configuration for standing involves ensuring that the original URDF (Unified Robot Description Format) model is set to fulfill the standing position. The goal is to minimize deviation from this original position during training.

If necessary, an orientation reward can be included to encourage the robot to maintain an upright posture. This can be achieved by adding a term to the reward function that penalizes deviations from the desired orientation.

Walking Rewards

For training the robot to walk, we have an additional set of rewards that are added to the standing rewards. Crucially, maintaining the original standing position accounts for 80% of the total reward during initial training, which ensures the policy first learns a stable standing position. This is essential since standing represents the base distribution from which other behaviors must develop.

Forward Velocity Reward: This reward encourages the robot to move forward. It can be defined as a function of the robot’s forward velocity, but is weighted to be less significant initially to prevent premature optimization for walking before stability is achieved.

Additional rewards such as feet clearance and contact forces are crucial for achieving sim2real transfer and handling various real-world properties like friction coefficients. These rewards ensure the policy learns realistic locomotion patterns that can translate to physical robots. The action smoothness reward particularly helps generate commands that are feasible for real-world actuators to execute under typical PID control schemes.

where $w_i$ is the weight for each additional reward component $r_i$ , also defined in the configuration scales.

Teleoperation EdgeVLA