NALO-VOM: Navigation-Oriented LiDAR-Guided Monocular Visual Odometry and Mapping


📌 Overview

Traditional monocular visual odometry (VO) often struggles with sparse environment maps that lack the structural detail necessary for autonomous vehicle navigation. To address this, we propose NALO-VOM, a system that transfers the geometric structural knowledge of 3D LiDAR to a monocular VO framework via an offline-trained major-plane prediction network. By utilizing non-artificial projection labels during training, the system enables a single camera to predict major-plane masks (MP-Masks) in real-time. This integration allows for scale-consistent camera pose estimation and the reconstruction of semi-dense maps that are high-quality and dense enough to be transformed into 2D grid maps for motion planning. Extensive experiments on the KITTI dataset and real-world platforms demonstrate that NALO-VOM achieves superior localization accuracy and provides a reliable mapping solution for obstacle avoidance and decision-making in UGV navigation.

🚀 Key Contributions

  • LiDAR-to-Camera Knowledge Transfer: We pioneered a method to transfer the structural representation ability of 3D LiDAR to a monocular VO system using a major-plane prediction network.
  • Major-Plane Mask (MP-Mask) Integration: The network predicts an MP-Mask for each image frame to assist in dense front-end tracking and ground plane extraction.
  • Scale Optimization: By leveraging the extracted ground plane, the system significantly constrains scale drift, a common issue in long-term monocular VO.
  • Semi-Dense Mapping: Major planes are incorporated to build a high-quality map capable of supporting UGV motion planning and decision-making.

🛠️ System Architecture & Methodology

System Overview The system architecture of NALO-VOM

Our framework consists of two main phases: an offline LiDAR-guided training process and an online monocular odometry process.

1. Offline Major-Plane Prediction Network

  • We utilize a ResNet-50-based architecture to predict major planes (e.g., ground, walls) that possess uniform planar curvature and occupy large areas.
  • The network is trained offline using non-artificial projection labels generated by perfectly synchronized camera-LiDAR pairs.
  • The training loss \(E(g)\) is formulated as a sum of the variance and a weighted squared mean of the error in log space: \(E(g)=\frac{1}{c}\sum_{i=0}^{h}\sum_{j=0}^{w}g_{(i,j)}^{2}-\frac{\lambda}{c^{2}}(\sum_{i=0}^{h}\sum_{j=0}^{w}g_{(i,j)})^{2}\)

2. Online Dense Tracking & Scale Optimization

  • Dense Tracking: During the online phase, the system only takes RGB images as inputs. The network generates an MP-Mask, which provides additional depth pixels for robust front-end tracking. For a point on a fitted plane parameterized by \(\pi=[\vec{n}^{T},\sigma]^{T}\), the newly added depth \(d_{p}^{*}\) is given by: \(d_{p}^{*}=\vec{n}^{T}\cdot K^{-1}\cdot p\cdot\sigma^{-1}\)

  • Photometric Optimization: The system minimizes the photometric error across a sliding window of length \(W\). The loss function is defined as: \(\mathcal{L} = \text{arg min}_{\xi_r, \xi_t, \Theta_{id}} \sum_{r=1}^W \sum_{t=1}^W \sum_{p \in N_r} \omega_p ||I_t(p') - I_r(p)||_{\gamma} \quad (3)\) where the weighting factor \(\\omega_p\) handles gradient consistency: \(\\omega_p = \frac{m^2}{m^2 + ||\nabla I_r(p)||_2^2} \quad (4)\)

  • Projection Model: The projected point position \(p'\) is derived via the transformation matrix: \(P_t = \begin{bmatrix} x \\ y \\ z \end{bmatrix} = R(d_p^{-1} K^{-1} p) + t \quad (5)\) \(T_{tr} = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix} \quad (6)\)

  • Depth Estimation & Uncertainty: The inverse depth \(d_p\) is calculated using the matching point \(u_m\) on the epipolar line: \(d_p = \frac{x_{pc} - u_m z_{pc}}{u_m z_t - x_t} \quad (7)\) To account for noise, we define the search range \((d_p^{min}, d_p^{max})$ using the uncertainty $\sigma_{\lambda}\): \(d_p^{min} = \frac{x_{pc} - (u_m + \sigma_{\lambda}) z_{pc}}{(u_m + \sigma_{\lambda}) z_t - x_t}, \quad d_p^{max} = \frac{x_{pc} - (u_m - \sigma_{\lambda}) z_{pc}}{(u_m - \sigma_{\lambda}) z_t - x_t} \quad (8, 9)\)

  • Scale Recovery: Assuming a constant vertical distance \(h_g\) between the camera and the local ground, we impose a scale constraint \(\rho_c = h_g / h_c\) to update the translation vectors and perform sliding window optimization.

3. Semi-Dense Map Reconstruction

  • To reconstruct texture-less areas without relying on computationally heavy pixel-wise depth networks, we introduce major-planes into the map building process.
  • The major-planes are aligned to the nearest planes in the sparse point cloud map using the following loss function: \(\mathcal{L}_{k}=\sum_{p\in^{n}\pi_{k}}\frac{||^{n}\pi_{k}^{DSO^{T}}\cdot p||_{2}}{nN_{k}}\)
  • This combination of sparse feature points and major-planes yields a dense environment map that accurately represents the environment’s geometry.

📊 Experimental Results

1. Localization Accuracy (KITTI Dataset)

On the KITTI dataset, NALO-VOM outperformed state-of-the-art methods (like ORB-SLAM2 and DSO), achieving the lowest translation error on 8 out of 11 sequences.

SeqORB-SLAM2 [9]DSO [3]Song et.al [40]Wang et.al [19]Ours
Trans (%) / Rot (deg/m)
0028.84 / 0.198229.78 / 0.20232.04 / 0.00481.01 / 0.00141.19 / 0.0028
01*1.79 / 0.0014--1.11 / 0.0009
022.63 / 0.00165.43 / 0.00381.50 / 0.00350.93 / 0.00181.91 / 0.0029
031.12 / 0.00200.79 / 0.00213.37 / 0.00210.52 / 0.00100.82 / 0.0021
042.25 / 0.00180.89 / 0.00212.19 / 0.00281.16 / 0.00230.85 / 0.0020
058.53 / 0.00557.80 / 0.00151.43 / 0.00381.45 / 0.00141.01 / 0.0015
0618.21 / 0.007418.08 / 0.00192.09 / 0.00812.92 / 0.00271.33 / 0.0019
079.60 / 0.01217.52 / 0.0038-1.73 / 0.00231.59 / 0.0036
0812.39 / 0.00328.36 / 0.00282.37 / 0.00441.18 / 0.00170.90 / 0.0021
0916.64 / 0.00539.54 / 0.00191.76 / 0.00471.17 / 0.00201.02 / 0.0022
105.07 / 0.00635.49 / 0.00342.12 / 0.00850.93 / 0.00290.85 / 0.0025
Avg10.53 / 0.02438.68 / 0.02062.10 / 0.00471.25 / 0.00201.14 / 0.0022
KITTI Seq 00 KITTI Seq 02 KITTI Seq 05 KITTI Seq 08

(Caption: Trajectory comparison on KITTI sequences 00, 02, 05, and 08. NALO-VO (red) stays closest to the ground truth (black).)

2. Mapping Density & Quality

The point densities of critical driving-aware objects built by NALO-VOM greatly exceed those built by baseline methods.

Standard DSO Mapping

Standard DSO Mapping (Sparse)

NALO-VOM Mapping

NALO-VOM Mapping (Ours, Semi-dense)

(Caption: Comparison of mapping density in a real-world scenario. NALO-VOM provides a much denser semi-dense point cloud suitable for navigation.)

3. Navigation Applicability

We successfully projected the semi-dense 3D point cloud into a 2D navigation grid map. Using the Hybrid A* algorithm, we proved that our map enables valid path planning and obstacle avoidance for UGVs.

2D Grid Map Comparison

Fig 7: 2D Grid Map Comparison

Hybrid A* Path Planning

Fig 8: Hybrid A* Path Planning

4. Real-World Deployment

The system was successfully deployed on real-world UGV platforms, running at ~14 FPS on an NVIDIA RTX 3070, demonstrating excellent generalization and real-time capability.

Scout 2.0 Platform

(a) Scout 2.0 Platform & Test Area

Pioneer P3-DX Platform

(b) Pioneer P3-DX Platform & Test Area

We evaluated NALO-VOM in various outdoor environments. The system maintained stable tracking and produced high-quality maps even in challenging conditions.

Real-world Mapping Results

Comparison of real-world mapping results in different campus scenarios.