Exclusive Roundtable | Future Directions and Technical Pathways for Robot Motion and Perception
livelybot | 2025.09.25

Speakers Introduction

 
Su Zhi A senior student in the Yao Class at Tsinghua University, he led the development of the HITTER hierarchical framework during his visit to UC Berkeley and has published two papers as the first co-author at top international robotics conferences.
Hou Taixian PhD candidate at Fudan University's School of Intelligent Robotics and Advanced Manufacturing Innovation. His research focuses on perception and mobile learning control for legged robots, including safe and robust locomotion, gait rhythm control, and extreme parkour. Key contributions include FTML, MusicWalker, and Re-net.
Zhuang Ziwen PhD candidate at the Institute for Interdisciplinary Information Sciences, Tsinghua University. His research focuses on robotic motion intelligence and learning algorithms for legged robots. His published work "Humanoid Parkour Learning" addresses the long-standing challenge of slow locomotion in humanoid robots and optimizes generalization in highly dynamic tasks.
Ding Gang Holds a Ph.D. in Computer Science from Peking University, where he studied under renowned scholar Professor Huang Tiejun. Formerly served as a researcher at the Beijing Academy of Artificial Intelligence, and currently leads the Humanoid Robotics division at BeingBeyond.
Zhang Xiaobai Founder & CEO of HighTorque Robotics.

Key Discussion Points

 
Origin of the Roundtable: Why Focus on "Humanoid Robots' Ultimate Capabilities and Implementation Pathways"?
 

Core Issue 1: When Will Humanoid Robots Surpass Humans?

1.Specialized movements already exceed humans, but gaps remain in perception and dexterous hand manipulation, requiring coordinated progress in both hardware and software. (Ding Gang)

2.Full-body manufacturing technology meets standards, yet balancing affordability and cost remains critical. Breakthroughs in dexterous hands must address flexibility, battery life, and durability. (Zhuang Ziwen, Ding Gang)

3.Clear application scenarios enable faster adoption, whereas general control algorithms require longer development cycles. Generalization capability depends heavily on data acquisition and scaling. (Zhuang Ziwen, Su Zhi)

4.Initial implementation possible within 2–3 years after hardware breakthroughs, with conservative estimates pointing to tangible progress within 5 years. Widespread generalization may take 5–10 years. (Ding Gang, Su Zhi, Hou Taixian)

   

Core Issue 2: Will Future Robot Motion Perception Methods Converge with or Diverge from Human Approaches?

1.Human perception carries evolutionary "legacy traits", while robots employ diverse sensors; humans achieve few-shot learning through genetic endowment, whereas robots require massive simulation data. (Zhuang Ziwen)

2.Reinforcement learning shares underlying logic with human learning; robot sensors may evolve toward "human-like dual RGB cameras"; morphological alignment with humans can enhance environmental and data compatibility. (Hou Taixian, Ding Gang)

3.Robots may follow a "pre-training + efficient RL" path, breaking away from "fixed pre-trained models" to achieve true "acquired learning." (Su Zhi, Zhang Xiaobai)

   

Core Issue 3: Research and Practice in Online Learning Frameworks for Robot Motion Algorithms

1.High iteration costs limit validation to simple scenarios; balance-critical tasks are "unaffordable to fail" due to hardware fragility and stringent sample efficiency requirements. (Zhuang Ziwen, Ding Gang)

2.Early approaches of "real-world first, simulation later" remain relevant with the rise of parallelized engines and accessible tools; core methods still apply, but require "foundation model + real-world fine-tuning." (Hou Taixian, Ding Gang)

3.Foundational control models are needed to reduce costs; online reinforcement learning requires foundation models to improve efficiency, though technical routes have not yet converged. (Zhuang Ziwen, Su Zhi)

4.Mass production may revive "real-device online learning" to accommodate personalized needs. (Hou Taixian)

   

Core Issue 4: Research and Exploration in Humanoid Robot Perception Capabilities

1.Perception-decision-control requires layered design: complex scenarios suffer from insufficient decision-making, while simple scenarios allow perception to directly supply control parameters. (Zhuang Ziwen)

2.Simple scenarios rely on system collaboration, whereas complex scenarios demand optimized perception performance. (Zhuang Ziwen)

3.SLAM models are large and computationally intensive, making edge deployment challenging; traditional SLAM performs poorly in dynamic environments. (Hou Taixian, Zhang Xiaobai)

4."VLM + control" transfers underlying motion skills; borrows biological logic of "capturing local key information"; uses hierarchical networks for navigation decisions. (Ding Gang, Zhuang Ziwen, Zhang Xiaobai)

   

Core Issue 5: Can We Develop Universal Motion Algorithms for Similar Robot Morphologies That Are Directly Deployable Across Platforms?

1."The 'Cerebrum' Can Be Shared, the 'Cerebellum' Cannot": A universal high-level planner ("cerebrum") combined with platform-specific low-level controllers ("cerebellum") can achieve indirect cross-platform deployment. (Ding Gang)

2.Training Frameworks Can Cross Humanoid Platforms; Quadrupeds and Bipeds are Difficult: A shared training framework is feasible for humanoid robots, but cross-platform application between quadrupeds and bipeds is challenging, with the prerequisite of "similar body proportions." (Zhuang Ziwen, Zhang Xiaobai)

3."Pre-planned Trajectories + Inverse Kinematics" for Legged Robots Limits Freedom; "Teacher-Student Distillation" Enables Cross-Morphology with Fine-tuning, Possibly at a Performance Cost: Pre-defined trajectories adapted via IK work for robots like robotic dogs but limit motion freedom. Knowledge distillation from a "teacher" model enables adaptation across morphologies, though it requires fine-tuning and might sacrifice some performance. (Hou Taixian, Su Zhi)

4.Morphology Extension is a Form of Cross-Platform; "Shared Lower Layers + Branched Upper Layers" Architecture Adapts to Different Forms: Expanding a robot's own capabilities (e.g., adding an arm) is a cross-platform problem. An architecture with shared lower-level layers and branched upper-level controllers can suit different morphologies. (Su Zhi, Hou Taixian)

 
   
"We believe humanoid robots will eventually reach and even surpass human capabilities, ultimately entering millions of households to serve humanity." This statement by roundtable initiator Zhang Xiaobai encapsulates the core vision driving this discussion.
 
During his visits to frontline research institutions, Zhang Xiaobai observed significant divergence in academic perspectives on core issues such as "how robots can break through key capabilities" and "when they might surpass humans."Some experts expressed concerns about the pace of hardware iteration, others emphasized bottlenecks in algorithms, some focused on the perception-motion closed loop, while yet others prioritized the integration of data and large models. Notably, previous academic conferences often centered on specific technical details, with few attempts to bridge "macro trends" and "in-depth technical challenges" in a unified dialogue.
 
Thus, this roundtable bringing together four frontline researchers came into being.On the eve of the discussion, newly surfaced research on platforms like Twitter — spanning general motion tracking and generalized perception algorithms — further confirmed that progress in humanoid robotics is advancing "faster than expected." Zhang Xiaobai opened the session with the first core question:When will humanoid robots surpass human capabilities?
 
 
All four speakers agreed that "surpassing humans" cannot be generalized and must be analyzed by specific capability dimensions. Their projections regarding the timeline were structured around three core variables: hardware iteration, algorithmic breakthroughs, and data accumulation.
 
Ding Gang: Hardware as the foundation, software and hardware need "two-way collaboration"
 
Let me start with a preliminary view. On this issue, we can consider it from different dimensions. If we focus solely on acrobatic movements, current robots can easily perform actions like backflips that are difficult for humans, demonstrating their performance advantages. However, when it comes to cognitive perception capabilities and dexterous manipulation with agile hands, there is still a significant gap in the current technological level.
 
Taking dexterous hands as an example, achieving the flexible operation of "grasping whatever one desires" is currently constrained by two aspects: firstly, the performance of the hardware itself, and secondly, the supporting capability of the brain's VLA (Vision-Language-Action). This indicates that there is still a long way to go in this field, requiring a "two-way collaboration" between software and hardware.
 
On the hardware front, the development of dexterous hands needs to overcome key bottlenecks: Can they achieve the agility of the human hand? Can they ensure long battery life? Can they balance durability (for example, performing actions like cracking walnuts without damage)? These are core issues that hardware manufacturers must address.
 
On the software side, only after hardware performance meets standards can richer practical data be obtained, thereby supporting VLA technology in achieving ideal results. Therefore, the development of software and hardware is an interdependent and mutually reinforcing process.
 
As for the specific timeline, since my expertise is primarily in the algorithm domain, I lack sufficient insight into the iteration pace and technical milestones of hardware development, making it difficult to provide a precise evaluation at this time. However, it is reasonable to speculate that if breakthrough progress in hardware can be achieved, initial implementations of related technologies could emerge within 2-3 years, with a conservative estimate suggesting tangible advancements within 5 years.
 
Zhuang Ziwen: Scenario Determines Speed — "Well-Defined Tasks" Will Land Sooner Than "Generalization Capability"
 
Indeed, as previously mentioned, the key lies in deconstructing the concept of "human-level performance." After all, many specialized movements are challenging even for untrained humans. Evaluating whether robots have reached "human-level" must be considered through specific dimensions.
 
First, considering the manufacturing and mechanical performance at the hardware level, current technology already fully meets the requirements. The real challenge lies not in technical implementation but in hardware manufacturers' need to balance "accessibility" and "cost": they must ensure global developers can both access and afford the technology, which inevitably involves trade-offs.
 
Regarding dexterous hand technology, the field itself remains relatively nascent and requires adaptation to diverse software systems and application scenarios. At this stage, specific application directions tied to dexterous hands are not yet well-defined, necessitating further exploration and real-world validation.
 
Finally, considering the software-oriented algorithm layer: if we focus on general-purpose control algorithms for humanoid robots, these will eventually need to integrate core capabilities such as perception and intelligence. I believe the path to maturity for such technologies will be relatively long.
 
However, for well-defined specific scenarios—such as arranging tables and chairs in a conference venue or performing basic cleaning tasks in a hotel—where requirements are clear and boundaries are distinct, the deployment of robots is likely to progress at a significantly faster pace.
 
Hou Taixian: Fundamental Science Has Verified the Ceiling, Hardware and Perception Are Key Gaps
 
As the previous two speakers noted, the application scenario is a key factor. This principle extends beyond robotics—even in foundational fields like mathematics, we see numerous cases of AI surpassing human capabilities. This indicates that the upper limit of robotic potential is clearly within sight.
 
However, the time required to reach this ceiling depends heavily on the specific task. For example, in scenarios such as parkour performances by robots, or previously mentioned tasks like arranging tables at home or serving as domestic helpers, there are still significant gaps in two core areas: first, the underlying hardware, particularly the performance of dexterous hands; and second, the perceptual capability to understand complex environments.
 
I believe that only when these two gaps are effectively bridged can robots potentially surpass humans in such tasks. However, in terms of timeline expectations, I tend to be more cautious and do not see this as a goal achievable within just two or three years.
 
Su Zhi: Generalization Applications of Humanoid Robots Are Nearing the Goal, with Data Acquisition and Scaling as Core Bottlenecks
 
I agree with the core views expressed by the previous three speakers. I also believe that enabling humanoid robots to achieve highly generalized capabilities—such as changing light bulbs, making beds at home, playing ball with people on weekends, or performing various tasks in factories—is already relatively close to realization.
 
It is particularly worth emphasizing that technological experiences from other fields can provide us with important references, such as the Scaling Law in the NLP domain. At this stage, the core challenges lie more in the "data" aspect: on one hand, how to obtain sufficient and high-quality training data, and on the other hand, how to scale up this data to support the enhancement of generalization capabilities.
 
Regarding the specific timeline, my personal judgment is that it will take approximately 5 to 10 years for humanoid robots to truly possess the generalization capabilities described. This timeframe is not particularly short, but neither is it excessively long.
 
Zhang Xiaobai (Moderator): A ten-year timeline is not long, as technological development driven by large models has exceeded expectations.
 
Regarding this issue, I believe that while there may be differences in perspectives, the overall consensus is clear—a ten-year timeline is not considered long. My view aligns with everyone else's: I remain consistently optimistic about the development of humanoid robots. The progress in the field of large models in recent years has particularly exceeded expectations. Of course, I am aware that large models do not equate to general intelligence, but it is undeniable that the current pace of technological iteration, the intensity of industry resource investment, and the growth rate of research talent—coupled with the efficiency of research advancement—have all far surpassed our previous imaginations. This lays a strong foundation for breakthroughs in robotics.
 
 
Zhuang Ziwen: The Core of "Difference" — Fundamental Logic Divergence Between Biology and Technology
 
I would like to approach this from the perspective of the "fundamental logical differences between humans and robots." As products of biological evolution, humans carry many "legacy traits." For example, the human eye, influenced by evolutionary pathways, does not have as wide a field of vision as that of prey animals. In contrast, the perceptual systems of robots are entirely different. With current technology, they can be equipped with a wide variety of sensors, each with distinct functions. This difference is directly reflected in two key aspects: On the one hand, whether it's deploying technology onto physical robots or collecting training data in the early stages, the types of data sources required for robots and the variety of matching sensors will be highly diverse. On the other hand, the logic by which humans and robots acquire knowledge and master skills is fundamentally different—much like mainstream reinforcement learning, which often requires robots to undergo extensive, long-term training in simulators. Yet, even a human infant learning to walk does not need such vast amounts of data to master the skill.
Hou Taixian: Reinforcement Learning Shares Similar Logic with Human Learning; Robot Sensors Can Be Optimized Toward "Human-like" Design
Regarding the same technology, different people may have different perspectives. For example, when it comes to reinforcement learning, I believe its logic is actually quite similar to human learning—we set up thousands of parallel learning environments in Isaac Lab for robots to train in, which mirrors the process of infants acquiring basic skills through trial and error.
 
As Dr. Zhuang mentioned earlier, differences in sensors indeed exist: humans perceive the world through their eyes (equivalent to two RGB perspectives), while robots often rely on devices like RealSense. However, this is not a fixed pattern—robot sensor selection is inherently diverse. We can equip them with human-like dual RGB camera solutions, or opt for non-human-like configurations.
 
However, from a broader trend perspective, I believe the industry is leaning toward developing robots in a "human-like" direction. Even though auxiliary perception devices like LiDAR are still needed today, in the long run, I consider the most elegant solution may ultimately align with human capabilities.
 
Su Zhi: Infants rely on genetic encoding and few-shot learning, while robots may follow a "pre-training + efficient RL" path.
 
As Ziwen just mentioned regarding the sample efficiency challenges of robot reinforcement learning in simulated environments, it's important to emphasize that infants' ability to learn with minimal data stems from the vast amount of genetically encoded information they possess. This pre-loaded information is gradually decoded during development, combined with limited real-world few-shot reinforcement learning, enabling rapid skill acquisition. This biological mechanism represents the core logic of infant learning.
 
Therefore, I believe the future development of robotics may follow a similar path: first, accumulating foundational capabilities through extensive pre-training tasks. Then, when facing novel tasks in real-world deployment, high-quality reinforcement learning algorithms—including both efficient pre-training methods and real-device reinforcement learning—will enable generalization to new scenarios with minimal data.
 
Ding Gang: The Value of Humanoid Robots Lies in Adapting to Human Environments and Data, with Internet Data Laying the Foundation for Their "Brain"
 
When I first entered the industry, I represented BAAI in purchasing Fourier Robotics. At the time, I asked Fourier's hardware lead: "What is the core value of humanoid robots?" He offered two key answers:
 
First, most facilities in today's society — such as tables and chairs — are designed for humans. Humanoid robots, by sharing our form, can seamlessly integrate into existing environments and equipment. Second, on the data side, the internet contains vast datasets centered on human activities. This data serves as the foundation for building the "brain" of robots. If we want humanoid robots to possess intelligence, we must rely on large-scale models — whether language, vision, or multimodal — and the development of such technologies depends heavily on abundant data support.
 
Therefore, since the application scenarios and data sources of robots are highly correlated with humans, aligning their form and other key dimensions with human characteristics is the more rational direction. This is also the core focus of our company, "Wisdom Without Boundaries" — we are leveraging massive internet data for training, using this data to build the "brain" of humanoid robots and empower them to achieve more generalized task execution.
 
Zhang Xiaobai (Moderator): "Pre-installed Models + Acquired Learning" — Robots Must Break Through "Acquired Learning" Capabilities After Model Pre-installation
 
Thank you all for your sharing. Everyone has just offered highly valuable insights, with some even reaching profound depths. For instance, one perspective drew an analogy to human DNA, suggesting that children are born with innate information and later acquire skills through learning — I find this logic entirely valid.
 
In the field of robotics, the future can indeed follow a similar path — we are already pre-installing models in robots to establish their foundational capabilities. However, a key issue remains: currently, robots are limited to "what is pre-installed is what they can do," lacking the capacity for subsequent autonomous expansion.
 
The advantage of humans lies in our capacity for acquired learning: from eating and writing to various motor skills, we gradually master them through growth. Enabling robots to not only rely on pre-installed models but also possess the ability to learn new skills post-deployment, much like humans, may be the next core direction requiring breakthrough.
 
 
Since robots require "post-deployment optimization," Zhang Xiaobai further asked: Is it necessary to develop "online learning motion algorithms" (where robots learn new actions in real-time within actual environments)? What are the current challenges and pathways?
 
Zhuang Ziwen: We have not yet initiated research into online motion learning algorithms, primarily due to the prohibitively high costs associated with physical robot validation and iteration.
 
I have not yet initiated research on online motion learning algorithms for robots, primarily due to the core requirement of scientific validation—rapid iteration. Although institutions like Berkeley have produced notable achievements in physical robot reinforcement learning in recent years, such work faces clear limitations: the tasks validated are predominantly simple scenarios, such as basic walking, with extensions covering only slightly more complex environments like grassy terrain or 40-degree slopes. It remains difficult to address validation in truly complex scenarios—for instance, conducting real-time reinforcement learning in dynamic environments like forests while integrating visual perception. The fundamental obstacle lies in the prohibitively high iteration costs of physical validation, which hinder the rapid completion of the "idea-validation-optimization" loop. Neural networks require training from scratch, and while funding can solve predictable challenges, the time cost remains the critical bottleneck—the process of continuous training to reach basic competency thresholds is difficult to accelerate significantly.
 
However, there is a consensus within the industry that if humanoid robots can first establish foundational control models — such as tracking models — the situation would significantly improve. With this foundational capability in place, subsequent Real-World Reinforcement Learning would not need to start training from scratch on low-level skills, thereby substantially reducing overall iteration costs.
 
Hou Taixian: The research path for RL robots was previously "reversed," but may return to this approach due to mass production needs.
 
Early RL research followed an "inverted path" of hardware-first, simulation-later due to immature physical simulation technology, contrary to the later common perception of "simulation-first, hardware-later." This approach was shaped by the technical constraints of the time rather than methodological preference.
ETH Zurich has been a pivotal hub for reinforcement learning robotics research, known for its breakthroughs in parallelized physics engine technology (exemplified by Raisim). Coupled with the increasing accessibility of tools like Isaac Lab and Isaac Gym to developers, simulation technologies have rapidly gained traction — leading subsequent researchers to perceive simulation as the initial catalyst in the field's evolution.
In my view, the core methodology of reinforcement learning (RL) remains unchanged and continues to hold practical value, with ongoing research papers continuously optimizing and adjusting it. The research community's prevailing pursuit of "zero-shot" capabilities without fine-tuning reflects more of a research taste preference rather than an insurmountable technical challenge.
 
If humanoid robots achieve mass production and enter households in the future, research may revisit the path of "real-device online learning" to address interactive performance and personalized user needs. This approach remains viable — its current absence in research reflects a phase-specific choice rather than a fundamental technical exclusion.
 
Ding Gang: Foundation models can reduce costs, but balance-critical tasks cannot afford failures, requiring robust foundational models as support.
 
I fully agree with Dr. Hou's perspective. The rise of simulation technology in robotics stems from researchers recognizing the limitations of relying solely on physical hardware for advancement, which led to the exploration of reinforcement learning in simulated environments. However, this does not imply that physical robots have lost their significance.
 
Taking our ongoing research on dexterous hands as an example, we still rely on the "pre-training + real-device fine-tuning" approach in practice. First, a powerful foundation model named "BeingH0" is trained on large-scale internet data (a recent work by BeingBeyond — interested peers are welcome to follow this achievement). With this foundational model as a base, the dexterous hand already possesses solid basic capabilities. Subsequently, only a small amount of real-device data is needed for fine-tuning, often completing the task within an hour or with minimal raw data.
 
However, the limitations of physical robot research become more pronounced in tasks requiring balance maintenance — to the extent that we "cannot afford falls." For instance, if a robot needs to learn high-difficulty balancing motions like cartwheels, any misstep during physical training could lead to hardware damage. Coupled with reinforcement learning's demanding sample efficiency requirements — often needing thousands or even millions of raw data samples — such balance-critical tasks become nearly impossible to achieve through pure physical robot training.
 
Therefore, I believe that for balance-critical tasks, it is absolutely infeasible to develop from scratch. We must first build a powerful foundation model with fundamental capabilities, and then further enhance its performance through fine-tuning. This represents a more viable technical pathway.
 
Su Zhi: The work on Online RL by Luo Jianlan's team reveals limitations; it requires Foundation Models to improve efficiency, and the academic pathway has not yet converged.
 
In the field of Online RL, I believe one of the more representative recent achievements is the two studies conducted by Professor Luo Jianlan's team. However, their work focuses on relatively simple scenarios, not involving humanoid robots, and primarily centers on manipulation tasks. Technically, it integrates Offline RL with Online RL. Even so, the Online RL component still exhibits clear limitations — even for basic actions like "pick-and-place," it requires tens of minutes of training, while humans can master similar actions in just one or two minutes.
 
To address this efficiency gap, I believe the key lies in introducing foundation models. Leveraging the pre-training capabilities of such foundational models can significantly improve the sample efficiency of Online RL and substantially shorten the training cycle. However, the academic community still faces challenges, as the overall technical roadmap for fine-tuning these foundation models through Online RL has not yet converged.
 
Zhang Xiaobai (Moderator): Looking Forward to Robot Adoption in Households, Calls for Breakthroughs in Perception Capabilities
 
From my perspective, I genuinely look forward to the realization of the goal for robots to "possess autonomous learning and adaptive capabilities." Once achieved, robots entering households will be able to deliver significantly more practical value.
 
Now let's focus on the next question: While previous discussions centered more on the general capabilities of robots, we now delve deeper into the specific field of perception. As Dr. Zhuang mentioned earlier in sharing research on parkour, using depth maps to enable robot parkour bears similarities to human perceptual logic. However, in most current application scenarios for humanoid robots, this is not the case — for instance, during robot sports events, most still rely on remote control, with few examples of actions completed through autonomous perception. The previously mentioned scenario of robots playing soccer also suffers from inefficient autonomous perception, resulting in suboptimal performance and a lack of "intelligence." This reflects the current reality.
 
Indeed, perception capability is precisely the key bottleneck for the commercialization and civilian adoption of robots: for robots to enter commercial settings or ordinary households, they must possess the function of "eyes" — the ability to autonomously perceive the world, identify problems, solve them, and even sense their own state and understand their own actions.
 
 
Zhang Xiaobai observed a practical issue: current humanoid robots still primarily rely on remote control in scenarios like sports events, demonstrating weak autonomous perception capabilities. To achieve commercialization, are robots' perception abilities sufficient? Where do the core bottlenecks lie?
 
Zhuang Ziwen: Perception is a module that must be viewed in a structured hierarchy alongside decision-making and control.
 
Regarding the discussion on robotic perception capabilities, I would like to first add a key perspective: in the pipeline from "perception to control," there exists a core intermediate step — "decision-making." The entire process should be broken down into three hierarchical layers: "perception-decision-control," rather than directly linking perception to control. Building on the earlier examples of robot sports events and soccer matches, the underperformance in these scenarios stems not from perception or control limitations, but from insufficient complexity in decision-making systems. However, perception does not always require decision-making to influence control — as demonstrated in my team’s two parkour-related papers. In such contexts, robots hardly need complex decision-making because the objectives are predefined. Here, perception serves to capture real-time environmental data directly informing control parameters — such as "whether to lift the leg higher or farther, or to speed up or slow down movements." The fundamental reason why scenarios like sports events and soccer are perceived as "challenging" lies in the significant increase in problem complexity. For tasks such as locomotion or motion tracking, the objectives are clear and can be directly defined as precise mathematical problems. However, soccer requires deciding "whom to pass to and when to shoot," while robot marathons involve "how to plan race routes and respond to sudden terrain changes." These navigation problems rely on understanding dynamic scenes and multi-object relationships, necessitating more sophisticated decision-making systems to handle such variables.
 
Zhang Xiaobai (Moderator): Do you consider the current state of perception capabilities to be fully sufficient?
 
Zhuang Ziwen: Perception requires systematic collaboration, with performance adapted to task complexity.
 
Robot perception depends on the comprehensive debugging or integrated development of the entire system. From a robotics perspective, isolating the perception system reveals it as a pipeline that processes raw sensor data to extract environmental states.
 
Based on my previous research in parkour-related studies, the perception module is one component of the robot's overall system. In task scenarios such as parkour, even if the perception module does not achieve optimal performance, it can still meet the overall operational requirements through collaborative adaptation with other parts of the system. However, as task complexity increases, the demands on the perception module correspondingly rise. The core requirement depends on the specific task's complexity and practical needs, necessitating tailored adaptation and optimization.
 
Hou Taixian: The Implicit Decision-Making Capability of Policies in Parkour Scenarios
 

Following the previous discussion, I would like to further explore a question: In scenarios like parkour, does the robot's policy inherently contain certain decision-making capabilities?

Taking the example of a robotic dog crossing boxes:

  • When the distance between two boxes is 60 cm, the robotic dog directly uses visual perception to assess the distance, processes environmental information through a GRU, and adopts a "direct jump" action strategy.

  • However, when the distance is adjusted to 70 cm, I observed a key phenomenon: the underlying policy chooses to descend from the first box onto a step, land with its hind legs, and then place its front legs on the second box.

This resembles an end-to-end problem, suggesting that the robot’s policy may have embedded certain decision-making capabilities directly into the control layer.

 
Zhuang Ziwen: Criteria for Delineating Control and Decision-Making
 
Regarding the definition of "whether a robot's action belongs to control or decision-making," I believe one key factor lies in the difference in problem complexity. In earlier navigation-related robotics research, legged robots also required planning for leg execution, but such planning is generally not considered decision-making. However, for mobile robots, navigation planning based on pre-built maps also involves planning, yet it appears more like decision-making. I use this example to indirectly illustrate that varying task difficulties lead to different problems neural networks need to solve.
 
Drawing an analogy to natural language processing, early efforts often focused on tasks like parsing syntax trees. When using smaller neural networks, they could adapt to such tasks as "syntax tree parsing" or "word-to-word translation." However, as the scale of neural networks expanded — somewhat akin to the scaling law — the problems they addressed grew increasingly complex, making them appear more like "decision-making" tasks.
 
Hou Taixian: Planning Strategies Under Increasing Terrain Complexity
 
Continuing the discussion on planning, I've noticed an interesting phenomenon: as the terrain complexity faced by robots continues to increase, many research efforts are now leaning toward "explicitly planning foot placement points." This approach actually bears some resemblance to early methods of robot motion training and, in a way, represents a kind of "return to fundamentals."
 
Including the paper on Extreme Parkour published around the same time as your work, a similar logic was adopted. It first predicts the robot's target movement direction for the next instant, then feeds this predicted direction information as an observation into the system, and subsequently combines environmental perception for further motion planning and control.
 
Zhuang Ziwen: The Overfitting Problem in Legged Robot Planning
 
In legged robot control systems, certain simple planning tasks can indeed be directly learned by neural networks. Taking the "robot jumping back and forth" scenario from previous parkour research as an example, this context exhibits certain characteristics of overfitting—it cannot handle situations beyond the training scope, such as "continuing to jump left after jumping left," indicating insufficient generalization capability. This very characteristic of overfitting, however, tends to make the model appear to exhibit "decision-making" in specific scenarios, while indeed addressing decision-making problems.
 
Hou Taixian: Returning to Perception Issues, Analyzing the Pain Points in SLAM Technology Implementation
 
Our discussion has somewhat diverged from the original focus on "perception issues," so let's return to this critical area. Recently, I have also been following SLAM-related technologies and found that the models developed in the current SLAM field are too large to be deployed quickly. Their computational demands are extremely high — even with acceleration tools like TensorRT, they cannot run on platforms like NX. In addition to SLAM models, we also opt for laser point cloud-based algorithms as perceptual inputs.
 
Zhang Xiaobai (Moderator): Current Status of Perception-Decision Solutions
 
As demonstrated by Dr. Zhuang Ziwen's parkour research, robots in highly dynamic motion scenarios can still effectively capture environmental information and adjust their movements through proprioceptive perception. This indicates the viability of such perceptual technological approaches. However, when decision-making becomes complex, existing methods struggle to adapt, shifting the core challenge to the decision-making domain. Currently, some academic approaches employ hierarchical network architectures, overlaying VLM or VLN modules on top of the perception layer to address navigation and reasoning decision requirements.
 
Just as Dr. Hou mentioned, current SLAM technology doesn't seem particularly suitable for edge-computing robot platforms with limited computational power, such as the new neural network GP3R. On the other hand, traditional SLAM relies on ranging and odometry, showing inadequate adaptability to complex dynamic scenarios, making it more suitable for specific functional applications. So, can we conclude that we haven't yet found a truly highly efficient perception + decision solution suitable for edge-computing mobile robots?
 
Ding Gang: Recognizes the Generalization Potential of VLM + Control Algorithms
 
It is currently difficult to determine which path will ultimately prove viable, but from a generalization perspective, I agree with Xiaobai's proposed approach of "VLM + control algorithms" — this combined model may possess stronger generalization potential.
 
Specifically, by leveraging the scene understanding and generalization capabilities of large models, these can be transferred to the robot's underlying motion abilities, such as parkour and various locomotion tasks. Simulating all complex real-world scenarios in a simulation environment is challenging. Attempting to encode all scenario information into a single model is not only impractical but might also require constructing extremely large models. However, large models, with their universal understanding of scenes and through massive annotated data and collected robot data, hold the potential to train models with strong navigation comprehension capabilities. Admittedly, this is difficult — the model hasn't yet converged — but this path possesses generalization potential and remains a viable direction.
 
Zhang Xiaobai (Moderator): Proposal of Two Technical Pathways for Perception-Decision Integration
 
I believe one approach involves training large models for VLM through "massive computing power + big data." However, there should be another path. In nature, many animals and insects can complete perception and decision-making for survival without complex mechanisms, indicating the existence of a more streamlined implementation. Particularly, I've observed Ziwen's team's parkour research, where they achieved stable robot parkour using only a small end-to-end model — a pattern somewhat reminiscent of how organisms rely on simple mechanisms to respond to their environment.
 
Zhuang Ziwen: Practical Deployment Strategies for SLAM Technology
 
Regarding SLAM technology, I would like to add a perspective. While SLAM undoubtedly plays a crucial role in the current stage of robotics development, it may become less critical when robots truly enter the phase of physical deployment.
 
Just as you previously mentioned with insects — they possess the ability to return home, and humans exhibit similar behavior during locomotion. We focus on locally detailed environments without consciously noting, for example, millimeter-level precision of obstacle positions three meters beyond a door.
 
Therefore, during physical deployment, a robot's environmental perception and state modeling system should no longer pursue "full-detailed mapping" of the entire scene, but instead focus its core efforts on the "accurate capture of locally critical information."
 
 
Ding Gang: Proposed that "the cerebrum can be generalized across platforms, but the cerebellum cannot."
 

This topic greatly interests me, as cross-embodiment research is primarily conducted by research institutes and universities. During my time at BAAI, all embodied intelligence departments were focused on cross-embodiment. My view on this is—the "cerebrum" can be transferred across embodiments, but the "cerebellum" cannot.

Here's a simple example: if your consciousness were transferred to a body with a significantly different physique, I believe your knowledge could be carried over. If you think you could immediately control that body to walk, then you believe cross-embodiment is feasible. But if you feel it would take time to adapt, then the conclusion is that it cannot be fully transferred.

Thus, our technical approach centered on developing a general "cerebrum" while creating separate control algorithms for different embodiments. As long as knowledge can be transmitted through the "cerebrum," we can indirectly achieve cross-embodiment capability. That is my perspective.

 
Zhuang Ziwen: Supplementary Conditions for Cross-Embodiment Training Framework Adaptation
 
I actually agree with this perspective. The "cerebrum" can certainly be transferred across embodiments. As for the "cerebellum," while the trained neural network itself cannot be directly transferred, the training framework can be applied across embodiments. If we're dealing with humanoid robots, even some highly advanced traditional motion control systems can achieve cross-embodiment compatibility. However, if we're trying to extend this to quadrupedal robots, that likely wouldn't be feasible.
 
Zhang Xiaobai (Moderator):
 
So the premise is morphological similarity. From the perspective of sharing the same training framework, cross-embodiment is certainly achievable.
 
Hou Taixian: Observations and Limitations in Cross-Embodiment Research
 
I've previously studied this area quite extensively. When we discuss cerebellum cross-embodiment, one particularly interesting work is Manyquadruped. That study essentially pre-planned the entire trajectory, then adapted it to new robotic dogs through inverse kinematics. This is one approach, though its limitations are quite apparent — the trajectory becomes relatively constrained, requiring rigid execution without the adaptability seen in parkour-like scenarios.
 
Therefore, while cross-embodiment capability can be achieved, it inevitably comes at the cost of sacrificing certain aspects — this holds true for quadrupedal cross-embodiment. Regarding cross-embodiment between quadrupeds and bipeds, I previously came across a project that essentially involved encoding from one human to a dog, then decoding from the dog to another human. The conclusion was that it's likely feasible, but significant compromises are unavoidable.
 
Therefore, I believe it is feasible, but if you seek high performance, the cost outweighs the benefits — significant sacrifices are inevitable. It's more efficient to train robots in thousands of simulation environments, where tasks that can be mastered in minutes don't require such complex structural designs. Alternatively, using large transformers can also achieve cross-embodiment, but overall, this technical pathway doesn't appear particularly elegant.
 
Su Zhi: Introduction to Technical Feasibility Research on Cross-Embodiment
On this point, I've noticed that Professor Hao Su's research group had a CoRL paper this year on cross-embodiment. Their approach involved working with robots of different morphologies (including quadrupeds, hexapods, and humanoids), where each robot's knees, joints, and body parts had different parameters. They first trained a teacher model on each robot type, then distilled all knowledge into a very large network, ultimately achieving cross-embodiment capabilities.
 
Let's set aside the question of necessity for now—technically, this is feasible. The "consciousness transfer" example mentioned earlier isn't entirely appropriate, because your cerebellum is overfitted to your own body, trained exclusively on your own physical data. However, with a more general controller, it can be fine-tuned on different embodiments to eventually achieve cross-embodiment control. This discussion temporarily bypasses the question of necessity and focuses on the technical possibility.
 
Zhang Xiaobai (Moderator): Extending Cross-Embodiment to Morphology Expansion Scenarios
 
Let's extend this question further: such "cross-embodiment" can also be seen as an expansion of the original body. For instance, wearing different shoes or using tools to extend one's hands — could this also be considered cross-embodiment or body expansion, where we use a new embodiment to learn new tasks?
 
Su Zhi: Using Knee Injury as an Example to Emphasize the Importance of Cross-Embodiment Adaptation
 
That's an excellent point. I've previously considered similar scenarios — for example, when a person suddenly suffers a knee injury, this can also be viewed as a form of cross-embodiment, because the parameters of your knee change significantly. Yet humans only require a short period to adapt. I believe this adaptive capability is particularly important.
 
Hou Taixian: Sharing multi-scale architecture application ideas in cross-embodiment
 
You've immediately touched on my first research work. This is essentially a multi-scale problem. The initial design was as follows: we first trained a hierarchical network where the upper layer was shared, and the lower layers were separated to handle different scenarios — such as intact legs or various types of leg impairments. An MLP was then used to connect the upper and lower layers, forming a large integrated network. Depending on the task, the corresponding "sub-module" would be activated for output, enabling multi-skill learning through this approach.
 
I believe this approach can also be applied to cross-embodiment scenarios. As previously mentioned with Transformers, humanoid robots have many degrees of freedom while quadruped robots have fewer — their observational data differs significantly, and simply padding with zeros isn't viable. Here, the "shared lower layers + branched upper layers" architecture becomes valuable: we can build a large model that incorporates multiple "cerebellum" modules, ultimately enabling control over robots with different morphologies.
 
 

This roundtable discussion centered on "Future Directions and Technical Pathways for Humanoid Robot Motion and Perception," featuring five participants who delved deeply into aspects ranging from technology and scenarios to development roadmaps. The exchanges included detailed analysis of core technologies like hardware iteration and algorithmic breakthroughs, predictions on the practical application timeline for deploying solutions in simple scenarios and achieving generalization capabilities, and clarification of industry consensus along with unresolved challenges in key areas such as perception-decision coordination and cross-embodiment adaptation. These insights are provided for reference by professionals both inside and outside the industry.

Moving forward, HighTorque Robotics will host more exchange events for academics and developers. Beyond supplying reliable hardware, we are committed to building a vibrant ecosystem platform for academic and development collaboration. Scholars and developers interested in cooperation and exchange are welcome to contact us via WeChat: dionysuslearning.

Form submitted successfully!
Form submission failed!