Text this: The spoken language in a multimodal context :