NALO-VOM: Navigation-Oriented LiDAR-Guided Monocular Visual Odometry and Mapping

📌 Overview

Traditional monocular visual odometry (VO) often struggles with sparse environment maps that lack the structural detail necessary for autonomous vehicle navigation. To address this, we propose NALO-VOM, a system that transfers the geometric structural knowledge of 3D LiDAR to a monocular VO framework via an offline-trained major-plane prediction network. By utilizing non-artificial projection labels during training, the system enables a single camera to predict major-plane masks (MP-Masks) in real-time. This integration allows for scale-consistent camera pose estimation and the reconstruction of semi-dense maps that are high-quality and dense enough to be transformed into 2D grid maps for motion planning. Extensive experiments on the KITTI dataset and real-world platforms demonstrate that NALO-VOM achieves superior localization accuracy and provides a reliable mapping solution for obstacle avoidance and decision-making in UGV navigation.

🚀 Key Contributions

LiDAR-to-Camera Knowledge Transfer: We pioneered a method to transfer the structural representation ability of 3D LiDAR to a monocular VO system using a major-plane prediction network.
Major-Plane Mask (MP-Mask) Integration: The network predicts an MP-Mask for each image frame to assist in dense front-end tracking and ground plane extraction.
Scale Optimization: By leveraging the extracted ground plane, the system significantly constrains scale drift, a common issue in long-term monocular VO.
Semi-Dense Mapping: Major planes are incorporated to build a high-quality map capable of supporting UGV motion planning and decision-making.

🛠️ System Architecture & Methodology

System Overview The system architecture of NALO-VOM

Our framework consists of two main phases: an offline LiDAR-guided training process and an online monocular odometry process.

1. Offline Major-Plane Prediction Network

We utilize a ResNet-50-based architecture to predict major planes (e.g., ground, walls) that possess uniform planar curvature and occupy large areas.
The network is trained offline using non-artificial projection labels generated by perfectly synchronized camera-LiDAR pairs.
The training loss $E(g)$ is formulated as a sum of the variance and a weighted squared mean of the error in log space: $E(g)=\frac{1}{c}\sum_{i=0}^{h}\sum_{j=0}^{w}g_{(i,j)}^{2}-\frac{\lambda}{c^{2}}(\sum_{i=0}^{h}\sum_{j=0}^{w}g_{(i,j)})^{2}$

2. Online Dense Tracking & Scale Optimization

Dense Tracking: During the online phase, the system only takes RGB images as inputs. The network generates an MP-Mask, which provides additional depth pixels for robust front-end tracking. For a point on a fitted plane parameterized by $\pi=[\vec{n}^{T},\sigma]^{T}$, the newly added depth $d_{p}^{*}$ is given by: $d_{p}^{*}=\vec{n}^{T}\cdot K^{-1}\cdot p\cdot\sigma^{-1}$
Photometric Optimization: The system minimizes the photometric error across a sliding window of length $W$. The loss function is defined as: $\mathcal{L} = \text{arg min}_{\xi_r, \xi_t, \Theta_{id}} \sum_{r=1}^W \sum_{t=1}^W \sum_{p \in N_r} \omega_p ||I_t(p') - I_r(p)||_{\gamma} \quad (3)$ where the weighting factor $\\omega_p$ handles gradient consistency: $\\omega_p = \frac{m^2}{m^2 + ||\nabla I_r(p)||_2^2} \quad (4)$
Projection Model: The projected point position $p'$ is derived via the transformation matrix: $P_t = \begin{bmatrix} x \\ y \\ z \end{bmatrix} = R(d_p^{-1} K^{-1} p) + t \quad (5)$ $T_{tr} = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix} \quad (6)$
Depth Estimation & Uncertainty: The inverse depth $d_p$ is calculated using the matching point $u_m$ on the epipolar line: $d_p = \frac{x_{pc} - u_m z_{pc}}{u_m z_t - x_t} \quad (7)$ To account for noise, we define the search range $(d_p^{min}, d_p^{max})$ using the uncertainty $\sigma_{\lambda}$: $d_p^{min} = \frac{x_{pc} - (u_m + \sigma_{\lambda}) z_{pc}}{(u_m + \sigma_{\lambda}) z_t - x_t}, \quad d_p^{max} = \frac{x_{pc} - (u_m - \sigma_{\lambda}) z_{pc}}{(u_m - \sigma_{\lambda}) z_t - x_t} \quad (8, 9)$
Scale Recovery: Assuming a constant vertical distance $h_g$ between the camera and the local ground, we impose a scale constraint $\rho_c = h_g / h_c$ to update the translation vectors and perform sliding window optimization.

3. Semi-Dense Map Reconstruction

To reconstruct texture-less areas without relying on computationally heavy pixel-wise depth networks, we introduce major-planes into the map building process.
The major-planes are aligned to the nearest planes in the sparse point cloud map using the following loss function: $\mathcal{L}_{k}=\sum_{p\in^{n}\pi_{k}}\frac{||^{n}\pi_{k}^{DSO^{T}}\cdot p||_{2}}{nN_{k}}$
This combination of sparse feature points and major-planes yields a dense environment map that accurately represents the environment’s geometry.

📊 Experimental Results

1. Localization Accuracy (KITTI Dataset)

On the KITTI dataset, NALO-VOM outperformed state-of-the-art methods (like ORB-SLAM2 and DSO), achieving the lowest translation error on 8 out of 11 sequences.

Seq	ORB-SLAM2 [9]	DSO [3]	Song et.al [40]	Wang et.al [19]	Ours
Trans (%) / Rot (deg/m)
00	28.84 / 0.1982	29.78 / 0.2023	2.04 / 0.0048	1.01 / 0.0014	1.19 / 0.0028
01	*	1.79 / 0.0014	-	-	1.11 / 0.0009
02	2.63 / 0.0016	5.43 / 0.0038	1.50 / 0.0035	0.93 / 0.0018	1.91 / 0.0029
03	1.12 / 0.0020	0.79 / 0.0021	3.37 / 0.0021	0.52 / 0.0010	0.82 / 0.0021
04	2.25 / 0.0018	0.89 / 0.0021	2.19 / 0.0028	1.16 / 0.0023	0.85 / 0.0020
05	8.53 / 0.0055	7.80 / 0.0015	1.43 / 0.0038	1.45 / 0.0014	1.01 / 0.0015
06	18.21 / 0.0074	18.08 / 0.0019	2.09 / 0.0081	2.92 / 0.0027	1.33 / 0.0019
07	9.60 / 0.0121	7.52 / 0.0038	-	1.73 / 0.0023	1.59 / 0.0036
08	12.39 / 0.0032	8.36 / 0.0028	2.37 / 0.0044	1.18 / 0.0017	0.90 / 0.0021
09	16.64 / 0.0053	9.54 / 0.0019	1.76 / 0.0047	1.17 / 0.0020	1.02 / 0.0022
10	5.07 / 0.0063	5.49 / 0.0034	2.12 / 0.0085	0.93 / 0.0029	0.85 / 0.0025
Avg	10.53 / 0.0243	8.68 / 0.0206	2.10 / 0.0047	1.25 / 0.0020	1.14 / 0.0022

(Caption: Trajectory comparison on KITTI sequences 00, 02, 05, and 08. NALO-VO (red) stays closest to the ground truth (black).)

2. Mapping Density & Quality

The point densities of critical driving-aware objects built by NALO-VOM greatly exceed those built by baseline methods.

Standard DSO Mapping (Sparse)

NALO-VOM Mapping (Ours, Semi-dense)

(Caption: Comparison of mapping density in a real-world scenario. NALO-VOM provides a much denser semi-dense point cloud suitable for navigation.)

We successfully projected the semi-dense 3D point cloud into a 2D navigation grid map. Using the Hybrid A* algorithm, we proved that our map enables valid path planning and obstacle avoidance for UGVs.

Fig 7: 2D Grid Map Comparison

Fig 8: Hybrid A* Path Planning

4. Real-World Deployment

The system was successfully deployed on real-world UGV platforms, running at ~14 FPS on an NVIDIA RTX 3070, demonstrating excellent generalization and real-time capability.

(a) Scout 2.0 Platform & Test Area

(b) Pioneer P3-DX Platform & Test Area

We evaluated NALO-VOM in various outdoor environments. The system maintained stable tracking and produced high-quality maps even in challenging conditions.

Comparison of real-world mapping results in different campus scenarios.

Share on

Twitter Facebook LinkedIn

Ziqi(Gabriel) Hu