Vision–Language–Action Models in Humanoid Robotics Practical Limits, Strategic Bets, and Market Implications

4 minute read

Published:

Originally published on Substack.

The humanoid robotics sector is entering a commercial-ready phase. Declining component costs, expanded manufacturing infrastructure, and incremental improvements in embodied cognition make targeted deployments plausible. One of the most debated decisions for product strategy is whether to build around Vision–Language–Action (VLA) models or adopt a hybrid architecture that combines high-level reasoning with robust low-level controllers and simulation-trained skills. In the recent speech, Unitree’s founder pointed out that the bigger problem is model architecture, not data quantity, and he also commented that the VLA is too limited. However, there is another startup Spirit AI is betting its roadmap on a VLA- first approach.

The strategic divergence is clear.

What “VLA” Means in Practice

VLA models attempt to unify three layers of robot cognition into a single model:

Vision: interpreting raw visual data from cameras.
Language: parsing human instructions or textual goals.
Action: generating motor control sequences directly from the perception– instruction input through the EtherCAT protocol.

The promise is appealing: one end-to-end model, a natural-language interface, and fewer bespoke engineering layers. But in physical robotics, the abstraction hides operational constraints that determine whether a product can scale.

Where VLA Performs Well — Spirit AI’s Rationale

VLA’s strongest commercial case is in constrained domains with repeatable layouts and predictable task types. Benchmarks and deployments show clear advantages in:

Natural UX: Users can issue instructions in plain language without learning system-specific commands.

Task decomposition: Strong performance in breaking down high-level instructions into a coherent sequence of subtasks.
Rapid iteration: With focused data collection, VLA can generalize well across similar environments (e.g., hotel lobbies, warehouse aisles). Simpler product stack: One core model reduces integration complexity.

Spirit AI’s Moz1 platform leverages these benefits, focusing on targeted verticals where environmental variation is limited, and the cost of retraining is acceptable.

Unitree’s Critique — The Structural Limits of VLA

The counterargument is rooted in engineering realities:

  1. Zero-shot generalization remains unreliable in messy, unfamiliar environments.

  2. Low-level control fidelity is brittle when generated directly from high- dimensional multimodal models.

  3. Skill accumulation via reinforcement learning has no proven scaling law; each new skill often requires retraining from scratch.

  4. Compute and latency constraints make high-capacity VLA challenging to deploy on-board without costly edge or cluster solutions.

Unitree’s position: the true market breakthrough will come when a robot can enter an unknown space, understand a request like “organize the living room,” and execute it reliably without preprogramming. My readers remembered my past article: Why I Walked Away from a $7 Million Robotics Deal, which mentioned a real unknown space for mining exploration, the current limitation was the key reason why I stopped the project. And yes, Current VLAs are not yet at that threshold.

Benchmark Evidence — The Current State of Play

While simulation benchmarks overstate readiness, several datasets and projects clarify the gap:

VLA models work in bounded domains today, but hybrid approaches offer more robustness when scaling beyond those domains. Years ago before the seed round of SpiritAI, I have talked to the founder and CEO of SpiritAI, Han Fengtao, he mentioned SpiritAI will focus on help elder in a nearly fixed space, which makes more sense for them to embrace VLA.

Alternative Path — Video-Driven World Models

Some companies are pursuing predictive video-based world models that simulate physics and visual dynamics to test candidate actions before execution.

Pros: Better modeling of environmental dynamics, potential for more efficient skill transfer.
Cons: Heavy GPU demand, simulation–reality gaps, slower productization.

This path aligns more closely with Unitree’s stance and may offer longer- term advantages for general-purpose robots, albeit with higher upfront investment.

Market Outlook & Strategic Recommendation

Short-Term Push: Use a VLA-first approach to land initial deployments and validate commercial value quickly. Limit scope (e.g., specific venues, controlled environments), and invest just enough to prove ROI and customer appetite.

Parallel Long-Term Investment: While pilots run, build hybrid infrastructure—world-model simulations, RL skill libraries, and modular controllers. Gradually shift higher-value or broad-market products onto this more scalable architecture.

Compute & Infrastructure Planning: For VLA-first, budget for edge inference or localized server racks. For hybrid, allocate cloud/GPU clusters for simulation and model training, but expect lower per-unit costs over time.

Combined, this hybrid strategy provides both a short-term runway and a long-term value corridor. Personal opinion, spirit AI choose a right way at the very beginning, let's see how their AI will evolve, while Unitree has the huge education market and will soon leverage the new Video-Driven World Models developed by Companies like Google.