Alibaba is building Qwen-Robot: the operating system for the robot economy


short

  • Alibaba has unveiled the Qwen-Robot Suite, a trio of AI modules designed to handle robotic navigation, manipulation and physics-based world simulation through a unified software suite.
  • The company says its models outperform many robotics benchmarks, using millions of training samples and tens of thousands of hours of open source robot data.
  • Deploying robots in the real world is still years away.

Alibaba’s Qwen team dropped the Qwen-Robot Suite on Tuesday: three core models that make up what they call the “full stack of embodied intelligence.” Qwen-RobotNav handles navigation. Qwen-RobotManip handles manipulation. Qwen-RobotWorld simulates the physics that makes both possible. Each works independently. Together they represent Android’s moment for robots — the operating system, not the hardware.

Alibaba is currently the only company in China working on chips, cloud, models, service platforms and applications. For the company, robots are the most physical expression of that bet, what is known as embodied artificial intelligence.

AI agents currently rely on LLMs to support their decisions. The usual way robots work is through machine learning models, which, although advanced, lack the adaptability of generative AI. Physical agents face a different, more stringent class of failure modes: physics, not catalysts.

For these use cases, Alibaba introduced this new AI suite with different components:

Coin-RobotNav It unifies five navigation tasks—following instructions, moving between points, searching for objects, tracking a target, and autonomous driving—each of which requires different visual memory strategies. Most models symbolize one strategy. Qwen-RobotNav displays an interface with parameters: token budget, time decay, and weights of each camera that the planner can reconstruct in the middle of the loop.

It was trained on 15.6 million samples with randomization across all parameters, and achieved 76.5% success on VLN-CE RxR, a benchmark for visual and language navigation in real-world environments, and 90% tracking on EVT-Bench, which evaluates an agent’s ability to constantly follow moving targets.

Coin-RobotManip It addresses one of the biggest challenges in working with robots: different robots represent actions in radically different ways. The Franka arm (a type of robot with seven axes of motion) acts through joint angles, while the ALOHA robot (a low-cost two-handed robot platform widely used in robotics research) represents actions through the position and orientation of its handles (end-actor poses). Humanoids add another layer of complexity by using whole-body coordinates.

To fill this mismatched business gap, Alibaba has collected approximately 38,100 hours of training data from open source bot datasets and human videos – without relying on proprietary data collection. The model ranks first in RoboChallenge Table30-v1, outperforming previous methods by 20%.

Coin-Robot World It is the most ambitious: a universal language-conditional video model that treats natural language as a universal interface. “Pick up the red cup and pour water on the flower” works whether the actor is a clutch, a self-driving vehicle, or a mobile navigational agent.

The global body of embodied knowledge spans 8.6 million pairs of video text – 200 million frames – across processing (5.9 million samples, 1,300+ skills, 20+ morphologies), autonomous driving (Waymo, NVIDIA PhysicalAI-AD, Bench2Drive), indoor navigation (VLNVerse), and human-to-robot transfer via 14 robotic arms.

It ranks first in EWMBench and DreamGen Bench, two benchmarks for evaluating whether global models predict and generate realistic physical environments. It also outperforms all open source models on WorldModelBench and PBench, and scores perfectly in adhering to physics: Newton’s laws, conservation of mass, fluid dynamics, and gravity.

ChatGPT for robots?

While Western labs (Google DeepMind, Nvidia, Figure, and Physical Intelligence) pursue similar goals, most focus on navigation or processing, not on a unified, composable package. Alibaba’s vertical integration from chips to applications means it controls the entire stack. The open source enterprise sets itself apart from competitors who rely on proprietary bot data.

There are some misconceptions that might be worth clarifying: these are not robots, but software models, that is, brains, not bodies. It runs on hardware from AgileX, Franka, Universal Robots, Unitree, and others.

Also, although they are bot-generative AI models, they are not LLM like a typical ChatGPT. The language model predicts tokens. These models must understand physics, spatial relationships, and the consequences of physical actions. The language model tells you that the glass breaks if it falls. Qwen-RobotWorld predicts how it will break – crash pattern, fluid dynamics, and secondary collisions. Qwen-RobotManip plans a grip that completely prevents falling.

Don’t expect to have your own home maid robot any time soon. The gap between a controlled demonstration of a robot putting fruit in a basket and a robot working reliably in your home is enormous. RoboCasa365, LIBERO-Plus, RoboTwin-Clean2Rand — these are the simulation standards. Real-world deployment introduces sensor noise, operator drift, and a long tail of edge cases that have humbled every robotics effort in history, and Alibaba knows this.

But the technical achievements are real. RobotManip’s alignment-first approach solves a real bottleneck in cross-rendering training. RobotNav’s parameterized monitoring interface is a clever solution to the context strategy problem. RobotWorld’s GUI is the right abstraction for modeling the world across domains.

Alibaba has not revealed pricing, timelines, or which customers will have access beyond the pilot programs.

Daily debriefing Newsletter

Start each day with the latest news, plus original features, podcasts, videos and more.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *