A surge in related works is happening on a daily basis. More recent works can be found on the GitHub page (https://github.com/BradyFU/Awesome-Multimodal-Large ...
The PlantIF framework consists of image and text feature extractors, semantic space encoders, and a multimodal feature fusion module. Image and text feature extractors are used to present visual and ...