Blogs
How to Engineer and Deploy Mainstream VIO/VSLAM Systems
This article focuses on engineering transformation strategies for mainstream VIO (Visual-Inertial Odometry) and VSLAM (Visual Simultaneous Localization and Mapping) systems (such as VINS-MONO and DSO). While the content aims for conciseness, a comprehensive grasp of core methods requires readers to have interdisciplinary capabilities spanning software, hardware, and algorithms.
Core Engineering Issues in Mainstream VIO Systems
1、ZUPT (Zero Velocity Update) and Failure in Special Scenarios
It is crucial to emphasize that hardware selection is vital for engineering deployment:
- For indoor environments, a global shutter camera must be used, with time synchronization between the IMU and camera achieved via a hardware MCU (Td controlled to approach 0);
- For outdoor use with rolling shutter cameras, exposure time must be precisely controlled to achieve commercial-grade loop closure performance.
Research teams have conducted a series of engineering validations using mainstream devices such as RealSense, Kinect V2, and Xiaomi MiVision. Even after full software optimization, performance on high-computing platforms like the i7-12th Gen remained subpar (Kinect V2 performed relatively better). Ultimately, self-developed camera components were used to completely resolve hardware-level bottlenecks.
2、Optimized Design for Engineering Overhead of the Main System
As classic academic achievements, VINS-MONO and DSO, along with their derivatives (e.g., VINS-Fusion, VI-DSO), have seen continuous algorithmic optimizations. However, aside from the aforementioned zero-velocity update issues, they still suffer from high overhead in engineering scenarios. The following two core links require key breakthroughs:
(1) Lightweight Design of Front-End Video Stream Display and Interaction Guidance Modules
The core function of this module is to provide clear video feedback through a human-machine interface, enabling users to intuitively perceive system initialization status and feature point tracking results. Yet, its optimization is often overlooked.
Laboratory versions (e.g., VINS, DSO) rely heavily on tools like Pangolin and RVIZ, directly calling raw data (RAW DATA) and using OpenCV’s software layer to overlay and display feature points. Such designs have significant flaws: processing raw data consumes substantial computing power, and CPU calls to library functions for point overlay further increase overhead, making the system unfit for embedded platforms. Even if it run, latency in transmitting visual data to the user end increases drastically.
Solution: Implement video stream encoding using the hardware codec engine of high-performance visual SOCs, and complete feature point overlay and status display via hardware OSD operators. This solution resolves long latency and high overhead issues for video streams on both embedded systems and PCs. The technology is mature in machine vision, supported by mainstream SOCs (e.g., HiSilicon 3519A), with diverse implementation methods—though it requires R&D teams to have strong PCBA design and embedded development capabilities.
(2) Engineering Implementation of 3D Point Cloud and Pose Display Modules
The core data output by VIO/VSLAM systems includes two types: 3D spatial coordinate information (e.g., inverse depth, homogeneous coordinates) and camera poses. For large-scale systems, this data is typically encapsulated into SDKs and data structures, but user-side visualization needs still require solving 3D point cloud processing issues. While VINS and DSO can extend display functions via RVIZ or Pangolin, this approach is not recommended for engineering use.
Solutions:
- For embedded system development, NVIDIA (NV) solutions simplify implementation due to their robust ecosystem; for Mali-based GPU solutions, proficiency in OpenCL is required.
- A better approach is to extend development on the PC side: first complete structured processing of point cloud and pose data with SDK encapsulation, then choose a suitable development method:
Option 1: C++ development based on the PCL point cloud library and QT framework.
Option 2: Browser-side development.
Initially, basic functions like rotation, translation, and scaling are sufficient to meet requirements.
The above optimization tasks, while labor-intensive and demanding high R&D capabilities, are essential for engineering deployment. Without these optimizations, VIO/VSLAM systems remain confined to the laboratory, failing to meet the practical needs of industry and end-users—over 95% of users will not allocate additional main control computing power to non-core functions.
Note: Core computing resources should prioritize the system’s core modules. In VIO and multi-sensor fusion systems, core modules include only front-end feature processing and back-end state estimation; loop closure detection is merely a "semi-core" function.
RoboBaton-VIOBOT2 provides pure-vision spatial perception cameras designed specifically for robot vision to enhance a robot's environmental awareness. Our cameras deliver real-time spatial perception data, including depth maps, position, and posture, helping robots achieve more efficient spatial localization, object recognition, path planning, dynamic scene understanding, and obstacle avoidance. They are a core hardware component for boosting robot vision performance.


