Text this: Generation driven understanding of localized 3D scenes with 3D diffusion model