Text this: Applying auxiliary supervised depth-assisted transformer and cross modal attention fusion in monocular 3D object detection