A multimodal architecture to build multimodal knowledge graphs with flexible multimodal feature extraction and dynamic multimodal concept generation.
Extract and fuse multimodal features for deep learning