Building on previous discussions, this article further explores the core characteristics and developmental challenges of Visual-Inertial Odometry (VIO) systems.
An ideal VIO system must meet multiple stringent requirements, which can be summarized as follows:
- It must possess high-precision pose estimation capabilities (as demonstrated by systems like ORB-SLAM3 and VINS-MONO);
- It must support semi-dense mapping functionality (following technical paths similar to DSO and DM-VIO);
- It must have practical deployment capabilities, capable of handling various extreme scenarios and integrating key technologies such as ZUPT (Zero Velocity Update) (referencing the design philosophy of PR-MONO1);
- It must meet low computational overhead requirements — systems that can only run on high-performance computing platforms have limited commercial value (except in the autonomous driving field), which also contradicts the fundamental design principles of VINS/VIO.



These requirements can be categorized into the “three-phase nature” of VIO systems: overhead, robustness, and accuracy. These three characteristics serve as the core design criteria and performance indicators of VIO systems, corresponding to the 4th, 3rd, and 1st requirements mentioned above, respectively. The main technical means to achieve these characteristics include parallel processing and chip-level optimization (for overhead), integration of ZUPT algorithms (for robustness), etc. The technical scope involved in accuracy optimization is broader and will not be elaborated here.
In engineering practice, the implementation sequence of the three-phase nature often contradicts the implementation path of academic research, as detailed in previous articles. Currently, after multiple iterations, software and hardware solutions have basically addressed the three-phase nature issues and adapted to scenarios such as extreme shock resistance, entering a critical stage of practical deployment. However, as the project progresses to this point, the core issue facing the team has shifted to the construction of a “worldview” — corresponding to the semi-dense mapping capability mentioned in the 2nd requirement above.
Interestingly, after resolving the three-phase nature issues, both the team and individuals fell into a complex emotional state. Logically, breaking through this core bottleneck should be exhilarating, but in reality, it is not. The reasons behind this can be summarized as follows:
- For VIO/VSLAM or binocular VSLAM systems, the greatest challenge is not the construction of the “worldview” itself, but maintaining the three-phase nature while constructing the worldview. In fact, based on existing technologies, the framework of the “worldview” has long been established. However, a “worldview” that cannot be compatible with the three-phase nature will ultimately remain a theoretical research outcome, difficult to achieve practical deployment.
- If an independent depth camera (D-camera) is used to construct environmental cognition (i.e., the “worldview”), regardless of the enormous computational overhead caused by the Iterative Closest Point (ICP) algorithm (which is difficult to significantly improve even with efficient data structures such as KD-trees and octrees), merely to meet the most critical “overhead” requirement in the three-phase nature, it is necessary to rely heavily on NVIDIA’s CUDA ecosystem and Kepler architecture. A more prominent issue is that depth cameras generally have a limited detection distance, and their performance can barely meet basic needs such as obstacle avoidance, making it difficult to support the construction of a “worldview” for large-scale environments.
- Therefore, the only feasible path left is to construct the “worldview” through pure visual odometry (VO) and visual Bundle Adjustment (BA) while maintaining the balance of the three-phase nature. A deeper anxiety is that even if the pure visual scheme can construct a “worldview” and achieve stable operation over long distances, compared with lidar and depth cameras, it still has insurmountable inherent limitations in distinguishing “solid/hollow” structures in textured areas — this is a significant shortcoming of the pure visual technical route.
Despite the above challenges, the essence of technological exploration lies in continuously breaking through bottlenecks. It is believed that with in-depth research, solutions will eventually emerge.
Add comment