Recovering and Simulating Pedestrians in the Wild

In order to apply autonomous vehicles in everyday life, it is necessary to verify whether the system is able to detect pedestrians and change the trajectory accordingly. However, real-world experiments would be unethical, and commonly used artist-designed human meshes require lots of effort.

Therefore, a recent paper on suggests employing real-world sensor data captured by an autonomous driving fleet, recover 3D motion and shapes of pedestrians, and use it in simulations.

Photo credit: Florida Department of Transportation / NHTSA

LiDAR point clouds and camera images are used. The shape and pose of pedestrians are recovered using both deep learning and energy minimization methods. A realistic LiDAR simulation system poses and deforms a single artist-created mesh using real-world data. That way, a large collection of meshes in different scenes is generated. The experimental evaluation shows that training with simulated data improves pedestrian detection performance.

Sensor simulation is a key component for testing the performance of self-driving vehicles and for data augmentation to better train perception systems. Typical approaches rely on artists to create both 3D assets and their animations to generate a new scenario. This, however, does not scale. In contrast, we propose to recover the shape and motion of pedestrians from sensor readings captured in the wild by a self-driving car driving around. Towards this goal, we formulate the problem as energy minimization in a deep structured model that exploits human shape priors, reprojection consistency with 2D poses extracted from images, and a ray-caster that encourages the reconstructed mesh to agree with the LiDAR readings. Importantly, we do not require any ground-truth 3D scans or 3D pose annotations. We then incorporate the reconstructed pedestrian assets bank in a realistic LiDAR simulation system by performing motion retargeting, and show that the simulated LiDAR data can be used to significantly reduce the amount of annotated real-world data required for visual perception tasks.