AI
Constructing accurate High-Definition (HD) maps is very important for the safety of autonomous driving systems. HD maps can provide comprehensive environmental information, such as road boundary, lane divider and pedestrian crossing, for perception, prediction and planning. A vectorized HD map consists of multiple map elements, each corresponding to a symbol on the road, such as a divider line, a pedestrian crossing area etc. Each vectorized map element is usually represented as a finite set of discrete points. Vectorized HD map construction aims at classifying and localizing the map elements in the Bird’s-Eye-View (BEV) space. The reconstruction results contain the class and point coordinates of elements, cf. Fig. 1.
Figure 1. Examples of previous failures and our improved results. Best viewed in color
Classic works mainly focus on point-level representation learning. VectorMapNet [1] introduces a keypoint representation to represent the outline of map elements and explores a coarse-to-fine two stage framework. MapTR [2] proposes the permutation equivalent modeling of the point set and utilizes a deformable decoder to directly regress point coordinates of elements. MapTRv2 [3] further incorporates dense supervision on both BEV and perspective views and a one-to-many matching strategy to improve the accuracy. However, such a pipeline limits the model’s capability to learn element-level information and correlations. As shown in the first row of Fig. 1(a), the corner detail of the road boundary is missing due to the inaccurate positions of some points. In (b) and (c), the length and direction of the element are not accurate due to the missing overall information. In (d), lane divider 1 and 2 are intertwined on account of similar point-level features of dividers. Based on the above observations, we argue the importance of learning element-level information.
To better learn and interact information of map elements, we propose a simple yet effective HybrId framework named HIMap based on hybrid representation learning. We first introduce a hybrid representation called HIQuery to represent all map elements in the map. It is a set of learnable parameters and can be iteratively updated and refined by interacting with BEV features. Then we design a multilayer hybrid decoder to encode hybrid information of map elements (e.g. point position, element shape) into HIQuery and perform point-element interaction. Each layer of the hybrid decoder comprises a point-element interactor, a self-attention, and an FFN. Inside the point-element interactor, a mutual interaction mechanism is performed to realize the exchange of point-level and element-level information and avoid the learning bias of single-level information. In the end, the output point-element integrated HIQuery can be directly converted into elements’ point coordinates, classes, and masks. Furthermore, we propose a point-element consistency constraint to strengthen the consistency between point-level and element-level information. We conduct extensive experiments and consistently outperform previous methods on both nuScenes and Argoverse2 datasets. Notably, our method achieves 77.8 mAP on the nuScenes dataset, remarkably superior to previous SOTAs by 8.3 mAP at least.
Figure 2. Overview of the HIMap. Top: The pipeline of HIMap. Bottom: Detailed process of the point-element interactor and the point-element consistency. Best viewed in color
The overall pipeline of HIMap is presented in Fig. 2(a).
Input. HIMap is compatible with various onboard sensor data, e.g. RGB images from multi-view cameras, point clouds from LiDAR, or multi-modality data. Here we take multi-view RGB images as an example to illustrate HIMap.
BEV Feature Extractor. We extract BEV features from multi-view RGB images with the BEV feature extractor. It consists of a backbone to extract multi-scale 2D features from each perspective view, an FPN to refine and fuse multi-scale features into single-scale features, and a 2D-to-BEV feature transformation module to map 2D features into BEV features. The BEV features can be denoted as X∈RH*W*C, where H,W,C refer to the spatial height, spatial width, and the number of channels of feature maps, respectively.
HIQuery. To sufficiently learn both point-level and element-level information of map elements, we introduce HIQuery to represent all elements in the map. HIQuery is a set of learnable parameters Qh∈RE*(P+1)*C, where E,P,C denote the maximum number of map elements (e.g. 50), the number of points in an element (e.g. 20), and the number of channels respectively. Inside HIQuery, Qih∈R(P+1)*C is responsible for one map element with index i∈{1,…,E}. In particular, Qih can be decomposed into two parts, point query Qip∈RP*C and element query Qie∈RC, corresponding to point-level and element-level information respectively. With this point-element integrated information, HIQuery can be easily converted into the corresponding elements’ point coordinates, classes, and masks.
Hybrid Decoder. The hybrid decoder produces the point-element integrated HIQuery by iteratively interacting HIQuery Qhwith BEV features X. It contains multiple layers, each comprising a point-element interactor, a self attention, a Feed Forward Network (FFN), and multiple prediction heads. In each layer l∈{1,…,L}, where L is the total number of layers in the hybrid decoder, the point-element interactor first extracts, interacts, and encodes point-level and element-level information of map elements into the input HIQuery Qh,l-1∈RE*(P+1)*C. Then the self-attention and the FFN successively refine both levels of information in the HIQuery. The output point-element integrated HIQuery Qh,l∈RE*(P+1)*Care forwarded into the class head, point head, and mask head to generate elements’ classes, point coordinates, and masks respectively. In the training stage, we apply the point-element consistency constraint on the intermediate representations from point and mask heads to enhance their consistency. The prediction results of the last layer are the final results of HIMap.
Point-element interactor targets to interactively extract and encode both the point-level and element-level information of map elements into HIQuery. The motivation for interacting two levels of information comes from their complementarity. The point-level information contains the local position knowledge, while the element-level information provides the overall shape and semantic knowledge. Hence the interaction enables mutual refinement of both local and overall information of map elements.
As shown in Fig. 2(b), the point-element interactor consists of a point feature extractor, an element feature extractor, and a point-element hybrider. Given BEV features X∈RH*W*C and HIQuery Qh,l-1∈RE*(P+1)*Cgenerated from the (l-1)-th layer, we first discompose Qh,l-1∈RE*(P+1)*C into point query Qp,l-1∈RE*P*C and element query Qe,l-1∈RE*C. Then we utilize point and element feature extractors to extract respective features from BEV features and leverage point-element hybrider to interact and encode information into HIQuery. In this process, a mutual interaction mechanism is realized by sharing position embeddings when applying two feature extractors and utilizing integrated information to update two levels of query inside the point-element hybrider.
Considering the primitive differences between point-level and element-level representations, which focus on local and overall information respectively, the learning of two levels of representations may also interfere with each other. This will increase the difficulty and reduce the effectiveness of information interaction. Therefore, we introduce the point-element consistency constraint to enhance the consistency between point-level and element-level information of each element. As a byproduct, the distinguishability of elements can also be strengthened.
Table 1 presents the comparison of results on nuScenes dataset with multi-view RGB images as input. Our HIMap achieves novel state-of-the-art performance (73.7, 51.6 mAP) under both easy and hard settings.
We also achieve state-of-the-art performance on more setting (e.g. multi-modality inputs) and more datasets (e.g. Argoverse2). Please refer to our CVPR paper for more details.
Table 1. Comparison to the State-of-the-art on nuScenes Val. The best results with the same backbone are in bold and the second in underline. Gains are calculated based on the best and the second results
In this paper, we introduce a simple yet effective HybrId framework (i.e. HIMap) based on hybrid representation learning for end-to-end vectorized HD map construction. In HIMap, we introduce HIQuery to represent all map elements, a point-element interactor to interactively extract and encode both point-level and element-level information into HIQuery, and a point-element consistency constraint to strengthen the consistency between two levels of information. With the above designs, HIMap achieves new SOTA performance on both nuScenes and Argoverse2 datasets.
[1] Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, and Hang Zhao. Vectormapnet: End-to-end vectorized hd map learning. In International Conference on Machine Learning, pages 22352–22369. PMLR, 2023.
[2] Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. arXiv preprint arXiv:2208.14437, 2022.
[3] Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Maptrv2: An end-to-end framework for online vectorized hd map construction. arXiv preprint arXiv:2308.05736, 2023