Back To the Future: Training VLMs for Document Understanding of Historic Government Records

October 22, 2025

https://pytorchconference.sched.com/event/28nTH

We presented a PyTorch-powered Vision-Language Model approach for structured extraction of names and dates of birth from 2.47 million historic Saxon civil records (1850–1950) in JSON format. Starting with only 1,242 labeled examples, we applied continued pretraining of Llama 3.2-11B-Vision on ~20B domain-specific German tokens using TorchTitan, synthetic handwriting data generation, and iterative bootstrapping with human correction to reach 7.99% CER while maintaining 99.9996% valid JSON outputs—outperforming conventional HTR+OCR methods and demonstrating emergent generalization on layout variations and handwritten corrections.