Edge Vision-Language-Action (EVLA)

EVLA is an efficient vision-language-action model designed specifically for edge device deployment in robotics applications. It builds upon OpenVLA’s architecture while significantly reducing computational requirements through strategic optimizations. The model achieves inference speeds of 30-50 Hz on edge devices like the Jetson Nano while maintaining the encoding representation power of larger vision-language models.

Architecture

EVLA’s architecture consists of three main components working in concert. The vision encoding is handled by pretrained SigLIP and DinoV2 models, which process and extract visual features from input images. The language processing is managed by Qwen2, a 0.5B parameter language model that represents a significant reduction in size compared to traditional approaches. These components are bridged by a projection layer that maps visual representations into the language model’s token space.

The model is trained on a comprehensive dataset of 1.2M text-image pairs, combining 558K samples from various captioning datasets with 665K synthetic multimodal instruction-tuning examples. This diverse training data ensures robust performance across a wide range of robotics tasks.

Key Innovations

The primary innovation in EVLA lies in its approach to control prediction. Traditional vision-language models use autoregressive prediction, generating outputs token by token. EVLA instead employs joint control prediction, outputting end-effector positions in a single forward pass. This architectural change results in a 6x increase in inference speed while maintaining prediction accuracy, making it particularly well-suited for real-time robotics control applications.

EVLA also leverages recent advances in small language models (SLMs). By using Qwen2 with just 0.5B parameters, EVLA achieves performance comparable to larger models while significantly reducing computational requirements. This efficiency makes it possible to deploy the model on affordable edge devices, dramatically lowering the barrier to entry for robotics research and development.

Performance and Deployment

When benchmarked against OpenVLA on an A100-40GB GPU, EVLA demonstrates faster inference time and lower memory requirements while maintaining similar training performance on both the Bridge and OXE datasets. This efficiency translates to practical benefits in deployment, allowing the model to run on affordable edge devices like the Jetson Nano instead of requiring expensive hardware like the $2,000+ Jetson AGX.

The model’s performance is expected to improve further with the implementation of optimization techniques such as flash_attention2 and FlexAttention mechanisms. Additionally, EVLA’s relatively small size opens up possibilities for deployment on CPU architectures, further expanding its potential applications.

Future Development

EVLA represents a significant step forward in making vision-language models practical for robotics applications. The model’s ability to maintain high performance while running efficiently on edge devices democratizes access to advanced robotics control techniques. Future updates will include specific deployment instructions for Jetson Nano and AGX platforms, as well as optimizations for various edge computing scenarios.