Describir: Video object extraction and representation