MapTrace: Scalable Data Generation for Route Tracing on Maps

Abstract

While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps, remains limited.

Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations.

To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities.

Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.

Synthetic Data Generation Overview

maptrace logo MapTrace Synthetic data generation pipeline. A LLM generates various map descriptions, which are rendered into images by a text-to-image model. Candidate path masks are extracted and filtered by a Mask Critic. Valid masks are converted into a pixel-graph to compute shortest-path candidates, which are then judged by a Path Critic for quality and traversability.

How do models perform in path tracing on maps?

We evaluate leading multimodal models on MapBench to assess their ability to trace routes across diverse map types (e.g., Malls, Urban areas, Zoos). We utilize Normalized Dynamic Time Warping (NDTW) to measure geometric alignment (lower is better) and Success Rate (SR) to measure instruction adherence.

How does our dataset improve performance?

Finetuning on our synthetic MapTrace dataset significantly bridges the gap between general capability and precise spatial reasoning. For Gemini 2.5 Flash, finetuning reduces the average NDTW error on real maps from MapBench by 32.5% (1.29→0.87), surpassing the heuristic baseline and achieving state-of-the-art performance.

BibTeX

TBD