Text this: Speech recognition using an english multimodal corpus with integrated image and depth information